Ive a csv file that I would like to get all the rows in one column. Ive tried importing into MS Excel or Formatting it with Notedpad++ . However with each try it considers a piece of data as a new row.
How can I format file with pythons csv module so that it removes a string "BRAS" and corrects the format. Each row is found between a quote " and delimiter is a pipe |.
Update:
"aa|bb|cc|dd|
ee|ff"
"ba|bc|bd|be|
bf"
"ca|cb|cd|
ce|cf"
The above is supposed to be 3 rows, however my editors see them as 5 rows or 6 and so forth.
import csv
import fileinput
with open('ventoya.csv') as f, open('ventoya2.csv', 'w') as w:
for line in f:
if 'BRAS' not in line:
w.write(line)
N.B I get a unicode error when trying to use in python.
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 18: character maps to <undefined>
This is a quick hack for small input files (the content is read to memory).
#!python2
fnameIn = 'ventoya.csv'
fnameOut = 'ventoya2.csv'
with open(fnameIn) as fin, open(fnameOut, 'w') as fout:
data = fin.read() # content of the input file
data = data.replace('\n', '') # make it one line
data = data.replace('""', '|') # split char instead of doubled ""
data = data.replace('"', '') # remove the first and last "
print data
for x in data.split('|'): # split by bar
fout.write(x + '\n') # write to separate lines
Or if the goal is only to fix the extra (unwanted) newline to form a single-column CSV file, the file can be fixed first, and then read through the csv module:
#!python2
import csv
fnameIn = 'ventoya.csv'
fnameFixed = 'ventoyaFixed.csv'
fnameOut = 'ventoya2.csv'
# Fix the input file.
with open(fnameIn) as fin, open(fnameFixed, 'w') as fout:
data = fin.read() # content of the file
data = data.replace('\n', '') # remove the newlines
data = data.replace('""', '"\n"') # add the newlines back between the cells
fout.write(data)
# It is an overkill, but now the fixed file can be read using
# the csv module.
with open(fnameFixed, 'rb') as fin, open(fnameOut, 'wb') as fout:
reader = csv.reader(fin)
writer = csv.writer(fout)
for row in reader:
writer.writerow(row)
For solving this you need not to go to even code.
1: Just open file in Notepad++
2: In first line select from | symble till next line
3: go to replace and replace the selected format with |
Search mode can be normal or extended :)
Well, since the line breaks are consistent, you could go in and do find/replace as suggested, but you could also do a quick conversion with your python script:
import csv
import fileinput
linecount = 0
with open('ventoya.csv') as f, open('ventoya2.csv', 'w') as w:
for line in f:
line = line.rstrip()
# remove unwanted breaks by concatenating pairs of rows
if linecount%2 == 0:
line1 = line
else:
full_line = line1 + line
full_line = full_line.replace(' ','')
# remove spaces from front of 2nd half of line
# if you want comma delimiters, uncomment next line:
# full_line = full_line.replace('|',',')
if 'BRAS' not in full_line:
w.write(full_line + '\n')
linecount += 1
This works for me with the test data, and if you want to change the delimiters while writing to file, you can. The nice thing about doing with code is: 1. you can do it with code (always fun) and 2. you can remove the line breaks and filter content to the written file at the same time.
Related
I am trying to read some data files '.txt' and some of them contain strange random characters and even extra columns in random rows, like in the following example, where the second row is an example of a right row:
CTD 10/07/30 05:17:14.41 CTD 24.7813, 0.15752, 1.168, 0.7954, 1497.¸ 23.4848, 0.63042, 1.047, 3.5468, 1496.542
CTD 10/07/30 05:17:14.47 CTD 23.4846, 0.62156, 1.063, 3.4935, 1496.482
I read the description of np.loadtxt and I have not found a solution for my problem. Is there a systematic way to skip rows like these?
The code that I use to read the files is:
#Function to read a datafile
def Read(filename):
#Change delimiters for spaces
s = open(filename).read().replace(':',' ')
s = s.replace(',',' ')
s = s.replace('/',' ')
#Take the columns that we need
data=np.loadtxt(StringIO(s),usecols=(4,5,6,8,9,10,11,12))
return data
This works without using csv like the other answer and just reads line by line checking if it is ascii
data = []
def isascii(s):
return len(s) == len(s.encode())
with open("test.txt", "r") as fil:
for line in fil:
res = map(isascii, line)
if all(res):
data.append(line)
print(data)
You could use the csv module to read the file one line at a time and apply your desired filter.
import csv
def isascii(s):
len(s) == len(s.encode())
with open('file.csv') as csvfile:
csvreader = csv.reader(csvfile)
for row in csvreader:
if len(row)==expected_length and all((isascii(x) for x in row)):
'write row onto numpy array'
I got the ascii check from this thread
How to check if a string in Python is in ASCII?
I have a csv file which contains data in two columns, as follows:
40500 38921
43782 32768
55136 49651
63451 60669
50550 36700
61651 34321
and so on...
I want to convert each data into it's hex equivalent, then concatenate them, and write them into a column in another csv file.
For example: hex(40500) = 9E34, and hex(38921) = 9809.
So, in output csv file, element A1 would be 9E349809
So, i am expecting column A in output csv file to be:
9E349809
AB068000
D760C1F3
F7DBECFD
C5768F5C
F0D38611
I referred a sample code which concatenates two columns, but am struggling with the converting them to hex and then concatenating them. Following is the code:-
import csv
inputFile = 'input.csv'
outputFile = 'output.csv'
with open(inputFile) as f:
reader = csv.reader(f)
with open(outputFile, 'w') as g:
writer = csv.writer(g)
for row in reader:
new_row = [''.join([row[0], row[1]])] + row[2:]
writer.writerow(new_row)
How can i convert data in each column to its hex equivalent, then concatenate them and write them in another file?
You could do this in 4 steps:
Read the lines from the input csv file
Use formatting options to get the hex values of each number
Perform string concatenation to get your result
Write to new csv file.
Sample Code:
with open (outputFile, 'w') as outfile:
with open (inputFile,'r') as infile:
for line in infile: # Iterate through each line
left, right = int(line.split()[0]), int(line.split()[1]) # split left and right blocks
newstr = '{:x}'.format(left)+'{:x}'.format(right) # create new string using hex values excluding '0x'
outfile.write(newstr) # write to output file
print ('Conversion completed')
print ('Closing outputfile')
Sample Output:
In[44] line = '40500 38921'
Out[50]: '9e349809'
ParvBanks solution is good (clear and functionnal), I would simplify it a little like that:
with open (inputFile,'r') as infile, open (outputFile, 'w+') as outfile:
for line in infile:
outfile.write("".join(["{:x}".format(int(v)) for v in line.split()]))
I have a text file say storyfile.txt
Content in storyfile.txt is as
'Twas brillig, and the slithy toves
Did gyre and gimble in the wabe;
All mimsy were the borogoves,
And the mome raths outgrabe
I have another file- hashfile.txt that contains some words separated by comma(,)
Content of hashfile.txt is:
All,mimsy,were,the,borogoves,raths,outgrabe
My objective
My objective is to
1. Read hashfile.txt
2. Insert Hashtag on each of the comma separated word
3. Read storyfile.txt . Search for same words as in hashtag.txt and add hashtag on these words.
4. Update storyfile.txt with words that are hash-tagged
My Python code so far
import in_place
hashfile = open('hashfile.txt', 'w+')
n1 = hashfile.read().rstrip('\n')
print(n1)
checkWords = n1.split(',')
print(checkWords)
repWords = ["#"+i for i in checkWords]
print(repWords)
hashfile.close()
with in_place.InPlace('storyfile.txt') as file:
for line in file:
for check, rep in zip(checkWords, repWords):
line = line.replace(check, rep)
file.write(line)
The output
can be seen here
https://dpaste.de/Yp35
Why is this kind of output is coming?
Why the last sentence has no newlines in it?
Where I am wrong?
The output
attached image
The current working code for single text
import in_place
with in_place.InPlace('somefile.txt') as file:
for line in file:
line = line.replace('mome', 'testZ')
file.write(line)
Look if this helps. This fulfills the objective that you mentioned, though I have not used the in_place module.
hash_list = []
with open("hashfile.txt", 'r') as f:
for i in f.readlines():
for j in i.split(","):
hash_list.append(j.strip())
with open("storyfile.txt", "r") as f:
for i in f.readlines():
for j in hash_list:
i = i.replace(j, "#"+j)
print(i)
Let me know if you require further clarification on the same.
I'm using Windows 7 and Python 3.4.
I have several multi-line text files (all in Persian) and I want to merge them into one under one condition: each line of the output file must contain the whole text of each input file. It means if there are nine text files, the output text file must have only nine lines, each line containing the text of a single file. I wrote this:
import os
os.chdir ('C:\Dir')
with open ('test.txt', 'w', encoding = 'UTF8') as OutFile:
with open ('news01.txt', 'r', encoding = 'UTF8') as InFile:
while True:
_Line = InFile.readline()
if len (_Line) == 0:
break
else:
_LineString = str (_Line)
OutFile.write (_LineString)
It worked for that one file but it looks like it takes more than one line in output file and also the output file contains disturbing characters like: &,   and things like that. But the source files don't contain any of them.
Also, I've got some other texts: news02.txt, news03.txt, news04.txt ... news09.txt.
Considering all these:
How can I correct my code so that it reads all files one after one, putting each in only one line?
How can I clean these unfamiliar and strange characters or prevent them to appear in my final text?
Here is an example that will do the merging portion of your question:
def merge_file(infile, outfile, separator = ""):
print(separator.join(line.strip("\n") for line in infile), file = outfile)
def merge_files(paths, outpath, separator = ""):
with open(outpath, 'w') as outfile:
for path in paths:
with open(path) as infile:
merge_file(infile, outfile, separator)
Example use:
merge_files(["C:\file1.txt", "C:\file2.txt"], "C:\output.txt")
Note this makes the rather large assumption that the contents of 'infile' can fit into memory. Reasonable for most text files, but possibly quite unreasonable otherwise. If your text files will be very large, you can this alternate merge_file implementation:
def merge_file(infile, outfile, separator = ""):
for line in infile:
outfile.write(line.strip("\n")+separator)
outfile.write("\n")
It's slower, but shouldn't run into memory problems.
Answering question 1:
You were right about the UTF-8 part.
You probably want to create a function which takes multiple files as a tuple of files/strings of file directories or *args. Then, read all input files, and replace all "\n" (newlines) with a delimiter (Default ""). out_file can be in in_files, but makes the assumption that the contents of files can be loaded in to memory. Also, out_file can be a file object, and in_files can be file objects.
def write_from_files(out_file, in_files, delimiter="", dir="C:\Dir"):
import _io
import os
import html.parser # See part 2 of answer
os.chdir(dir)
output = []
for file in in_files:
file_ = file
if not isinstance(file_, _io.TextIOWrapper):
file_ = open(file_, "r", -1, "UTF-8") # If it isn't a file, make it a file
file_.seek(0, 0)
output.append(file_.read().replace("\n", delimiter)) # Replace all newlines
file_.close() # Close file to prevent IO errors # with delimiter
if not isinstance(out_file, _io.TextIOWrapper):
out_file = open(out_file, "w", -1, "UTF-8")
html.parser.HTMLParser().unescape("\n".join(output))
out_file.write(join)
out_file.close()
return join # Do not have to return
Answering question 2:
I think you may of copied from a webpage. This does not happen to me. The & and   are the HTML entities, (&) and ( ). You may need to replace them with their corresponding character. I would use HTML.parser. As you see in above, it turns HTML escape sequences into Unicode literals. E.g.:
>>> html.parser.HTMLParser().unescape("Alpha < β")
'Alpha < β'
This will not work in Python 2.x, as in 3.x it was renamed. Instead, replace the incorrect lines with:
import HTMLParser
HTMLParser.HTMLParser().unescape("\n".join(output))
Can you explain what is going on in this code? I don't seem to understand
how you can open the file and read it line by line instead of all of the sentences at the same time in a for loop. Thanks
Let's say I have these sentences in a document file:
cat:dog:mice
cat1:dog1:mice1
cat2:dog2:mice2
cat3:dog3:mice3
Here is the code:
from sys import argv
filename = input("Please enter the name of a file: ")
f = open(filename,'r')
d1ct = dict()
print("Number of times each animal visited each station:")
print("Animal Id Station 1 Station 2")
for line in f:
if '\n' == line[-1]:
line = line[:-1]
(AnimalId, Timestamp, StationId,) = line.split(':')
key = (AnimalId,StationId,)
if key not in d1ct:
d1ct[key] = 0
d1ct[key] += 1
The magic is at:
for line in f:
if '\n' == line[-1]:
line = line[:-1]
Python file objects are special in that they can be iterated over in a for loop. On each iteration, it retrieves the next line of the file. Because it includes the last character in the line, which could be a newline, it's often useful to check and remove the last character.
As Moshe wrote, open file objects can be iterated. Only, they are not of the file type in Python 3.x (as they were in Python 2.x). If the file object is opened in text mode, then the unit of iteration is one text line including the \n.
You can use line = line.rstrip() to remove the \n plus the trailing withespaces.
If you want to read the content of the file at once (into a multiline string), you can use content = f.read().
There is a minor bug in the code. The open file should always be closed. I means to use f.close() after the for loop. Or you can wrap the open to the newer with construct that will close the file for you -- I suggest to get used to the later approach.