data = [line.strip('\n') for line in file3]
# print(data)
data2 = [line.split(',') for line in data]
data_dictionary = {t[0]:t[1] for t in data2}
print(data_dictionary)
So I'm reading content from a file under the assumption that there is no whitespace at the beginning of each line and not blank lines anywhere.
when I read this file I first strip the newline character and the split the data by a ',' because that is what the data in the file is separated by. but when I make the dictionary it returns two dictionaries instead of one it's doing that for other files where I use this procedure. how do I fix this?
Related
I am reading from a CSV file and appending the rows into a list. There are some white spaces that are causing issues in my script. I need to remove those white spaces from the list which I have managed to remove. However can someone please advise if this is the right way to do it?
ip_list = []
with open('name.csv') as open_file:
read_file = csv.DictReader(open_file)
for read_rows in read_file:
ip_list.append(read_rows['column1'])
ip_list = list(filter(None, ip_list))
print(ip_list)
Or a function would be preferable?
Here is a good way to read a csv file and store in list.
L=[] #Create an empty list for the main array
for line in open('log.csv'): #Open the file and read all the lines
x=line.rstrip() #Strip the \n from each line
L.append(x.split(',')) #Split each line into a list and add it to the
#Multidimensional array
print(L)
For example this csv file would produce an output like
This is the first line, Line1
This is the second line, Line2
This is the third line, Line3
This,
List = [('This is the first line', 'Line1'),
('This is the second line', 'Line2'),
('This is the third line', 'Line3')]
Because csv means comma seprated values you can filter based on commas
I wrote a function in python3 which merges some files in the same directory and returns a csv file as the output but the problem with csv file is that I get one extra column at the beginning which does not have header and the other rows of that columns are numbers starting from 0. do you know how I write the csv file without getting the extra column?
you can split by ,, and then use slicing to remove the first element.
example:
original = """col1,col2,col3
0,val01,val02,val03
1,val11,val12,val13
2,val21,val22,val23
"""
original_lines = original.splitlines()
result = original_lines[:1] # copy header
for line in original_lines[1:]:
result.append(','.join(line.split(',')[1:]))
print('\n'.join(result))
Output:
col1,col2,col3
val01,val02,val03
val11,val12,val13
val21,val22,val23
I have a csv file which is not consistent. It looks like this where some have a middle name and some do not. I don't know the best way to fix this. The middle name will always be in the second position if it exists. But if a middle name doesn't exist the last name is in the second position.
john,doe,52,florida
jane,mary,doe,55,texas
fred,johnson,23,maine
wally,mark,david,44,florida
Let's say that you have ① wrong.csv and want to produce ② fixed.csv.
You want to read a line from ①, fix it and write the fixed line to ②, this can be done like this
with open('wrong.csv') as input, open('fixed.csv', 'w') as output:
for line in input:
line = fix(line)
output.write(line)
Now we want to define the fix function...
Each line has either 3 or 4 fields, separated by commas, so what we want to do is splitting the line using the comma as a delimiter, return the unmodified line if the number of fields is 3, otherwise join the field 0 and the field 1 (Python counts from zero...), reassemble the output line and return it to the caller.
def fix(line):
items = line.split(',') # items is a list of strings
if len(items) == 3: # the line is OK as it stands
return line
# join first and middle name
first_middle = join(' ')((items[0], items[1]))
# we want to return a "fixed" line,
# i.e., a string not a list of strings
# we have to join the new name with the remaining info
return ','.join([first_second]+items[2:])
I have tweets saved in JSON text files. I have a friend who wants tweets containing keywords, and the tweets need to be saved in a .csv. Finding the tweets is easy, but I run into two problems and am struggling with finding a good solution.
Sample data are here. I have included the .csv file that is not working as well as a file where each row is a tweet in JSON format.
To get into a dataframe, I use pd.io.json.json_normalize. It works smoothly and handles nested dictionaries well, but pd.to_csv does not work because it does not handle, as far as I can tell, string literals well. Some of the tweets contain '\n' in the text field, and pandas writes new lines when that happens.
No problem, I process pd['text'] to remove '\n'. The resulting file still has too many rows, 1863 compared to the 1388 it should. I then modified my code to replace all string-literals:
tweets['text'] = [item.replace('\n', '') for item in tweets['text']]
tweets['text'] = [item.replace('\r', '') for item in tweets['text']]
tweets['text'] = [item.replace('\\', '') for item in tweets['text']]
tweets['text'] = [item.replace('\'', '') for item in tweets['text']]
tweets['text'] = [item.replace('\"', '') for item in tweets['text']]
tweets['text'] = [item.replace('\a', '') for item in tweets['text']]
tweets['text'] = [item.replace('\b', '') for item in tweets['text']]
tweets['text'] = [item.replace('\f', '') for item in tweets['text']]
tweets['text'] = [item.replace('\t', '') for item in tweets['text']]
tweets['text'] = [item.replace('\v', '') for item in tweets['text']]
Same result, pd.to_csv saves a file with more rows than actual tweets. I could replace string literals in all columns, but that is clunky.
Fine, don't use pandas. with open(outpath, 'w') as f: and so on creates a .csv file with the correct number of rows. Reading the file, either with pd.read_csv or reading line by line will fail, however.
It fails because of how Twitter handles entities. If a tweet's text contains a url, mention, hashtag, media, or link, then Twitter returns a dictionary that contains commas. When pandas flattens the tweet, the commas get preserved within a column, which is good. But when the data are read in, pandas splits what should be one column into multiple columns. For example, a column might look like [{'screen_name': 'ProfOsinbajo','name': 'Prof Yemi Osinbajo','id': 2914442873,'id_str': '2914442873', 'indices': [0,' 13]}]', so splitting on commas creates too many columns:
[{'screen_name': 'ProfOsinbajo',
'name': 'Prof Yemi Osinbajo',
'id': 2914442873",
'id_str': '2914442873'",
'indices': [0,
13]}]
That is the outcome whether I use with open(outpath) as f: as well. With that approach, I have to split lines, so I split on commas. Same problem - I do not want to split on commas if they occur in a list.
I want those data to be treated as one column when saved to file or read from file. What am I missing? In terms of the data at the repository above, I want to convert forstackoverflow2.txt to a .csv with as many rows as tweets. Call this file A.csv, and let's say it has 100 columns. When opened, A.csv should also have 100 columns.
I'm sure there are details I've left out, so please let me know.
Using the csv module works. It writes the file out as a .csv while counting the lines, then reads it back in and counts the lines again.
The result matched, and opening the .csv in Excel also gives 191 columns and 1338 lines of data.
import json
import csv
with open('forstackoverflow2.txt') as f,\
open('out.csv','w',encoding='utf-8-sig',newline='') as out:
data = json.loads(next(f))
print('columns',len(data))
writer = csv.DictWriter(out,fieldnames=sorted(data))
writer.writeheader() # write header
writer.writerow(data) # write the first line of data
for i,line in enumerate(f,2): # start line count at two
data = json.loads(line)
writer.writerow(data)
print('lines',i)
with open('out.csv',encoding='utf-8-sig',newline='') as f:
r = csv.DictReader(f)
lines = list(r)
print('readback columns',len(lines[0]))
print('readback lines',len(lines))
Output:
columns 191
lines 1338
readback lines 1338
readback columns 191
#Mark Tolonen's answer is helpful, but I ended up going a separate route. When saving the tweets to file, I removed all \r, \n, \t, and \0 characters from anywhere in the JSON. Then, I save the file with as tab separated so that commas in fields like location or text do not confuse a read function.
I need to read in a file, then strip the lines of the file, then split the values on each line and finally writing out to a new file. Essentially when I split the lines, all the values will be strings, then once they have been split each line will be its own list! The code I have written is still just copying the text and pasting it to the new file without stripping or splitting values!
with open(data_file) as data:
next(data)
for line in data:
line.rstrip
line.split
output.write(line)
logging.info("Successfully added lines")
with open(data_file) as data:
next(data) #Are you sure you want this? It essentially throws away the first line
# of the data file
for line in data:
line = line.strip()
line = line.split()
output.write(line)
logging.info("Successfully added lines")