I have a csv file containing 60,000 entries. I read them and store in a nested list like this:
entries = []
with open('mnist_train.csv', 'r') as f:
mycsv = csv.reader(f)
for row in mycsv:
entries.append(row)
Instead of reading all 60,000 how would I read only the first thousand entries?
I tried this without success:
entries = []
with open('mnist_train.csv', 'r') as f:
mycsv = csv.reader(f)
for row in mycsv[:1000]:
entries.append(row)
As you've discovered a csv.reader does not support slicing. You can use itertools.islice() to accomplish this with objects that are iterable. E.g.,
import itertools
entries = []
with open('mnist_train.csv', 'r') as f:
mycsv = csv.reader(f)
for row in itertools.islice(mycsv, 1000):
entries.append(row)
You can use the pandas library-
import pandas as pd
data = pd.read_csv('path/to/your/file.csv',nrows=1000)
data_list = data.values.tolist() #creates a list of the first 1000 rows (excludes header)
Related
I have to read a CSV file N lines at a time.
csv_reader = csv.reader(csv_file, delimiter=',')
line_count = 0
for row in csv_reader:
print row
I know I can loop N times at a time, build a list of list and process that way.
But is there a simpler way of using csv_reader so that I read n lines at a time.
Hi I don't think that you'll be able to do that without a loop with csv package.
You should use pandas (pip install --user pandas) instead:
import pandas
df = pandas.read_csv('myfile.csv')
start = 0
step = 2 # Your 'N'
for i in range(0, len(df), step):
print(df[i:i+step])
start = i
Pandas has a chunksize option to their read_csv() method and I would probably explore that option.
If I was going to do it myself by hand, I would probably do something like:
import csv
def process_batch(rows):
print(rows)
def get_batch(reader, batch_size):
return [row for _ in range(batch_size) if (row:=next(reader, None))]
with open("data.csv", "r") as file_in:
reader = csv.reader(file_in)
while batch := get_batch(reader, 5):
process_batch(batch)
I'm using pandas to open a CSV file that contains data from spotify, meanwhile, I have a txt file that contains various artists names from that CSV file. What I'm trying to do is get the value from each row of the txt and automatically search them in the function I've done.
import pandas as pd
import time
df = pd.read_csv("data.csv")
df = df[['artists', 'name', 'year']]
def buscarA():
start = time.time()
newdf = (df.loc[df['artists'].str.contains(art)])
stop = time.time()
tempo = (stop - start)
print (newdf)
e = ('{:.2f}'.format(tempo))
print (e)
with open("teste3.txt", "r") as f:
for row in f:
art = row
buscarA()
but the output is always the same:
Empty DataFrame
Columns: [artists, name, year]
Index: []
The problem here is that when you read the lines of your file in Python, it also gets the line break per row so that you have to strip it off.
Let's suppose that the first line of your teste3.txt file is "James Brown". It'd be read as "James Brown\n" and not recognized in the search.
Changing the last chunk of your code to:
with open("teste3.txt", "r") as f:
for row in f:
art = row.strip()
buscarA()
should work.
Requirement: Read large CSV file (>1million rows) in chunk
Issue: Sometimes the generator yields the same set of rows twice even though the file has unique rows. But some runs it looks fine with no duplicates
Looks like I am missing something in the code, I am not able to figure out
Want to make sure it doesn't yield the same object over and over with different contents
Code:
def gen_chunks(self,reader, chunksize=100000):
chunk = []
for i, line in enumerate(reader):
if (i % chunksize == 0 and i > 0):
yield list(map(tuple, chunk))
chunk = []
chunk.append(line)
yield list(map(tuple, chunk))
def execute(self, context):
with tempfile.NamedTemporaryFile() as f_source:
s3_client.download_file(self.s3_bucket, self.s3_key, f_source.name)
with open(f_source.name, 'r') as f:
csv_reader = csv.reader(f, delimiter='|')
for chunk in self.gen_chunks(csv_reader):
logger.info('starting in chunk process')
orcl.bulk_insert_rows(table=self.oracle_table,rows=chunk, target_fields=self.target_fields, commit_every=10000)
Idk if you have an option to try pandas, if yes then this could possibly be your answer.
I find pandas faster when working with millions of records in a csv,
here is some code that will help you
import pandas as pd
chunks = pd.read_csv(f_source.name, delimiter="|", chunksize=100000)
for chunk in chunks:
for row in chunk.values:
print(row)
pandas provides a lot of options with read_csv :
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
I am looking for comprehension to read a csv and create a dictionary where key is string and value is list
the csv looks like
fruit,Apple
vegetable,Onion
fruit,Banana
fruit,Mango
vegetable,Potato
my output should be like
{'fruit':['Apple','Banana','Mango'],'vegetable':['Onion','Potato']}
I am looking for dictionary comprehension to do that , I tried like
def readCsv(filename):
with open(filename) as csvfile:
readCSV = csv.reader(csvfile, delimiter='\t')
dicttest={row[1]:[].append(row[2]) for row in readCSV}
return dicttest
Hi Is this what you are trying to achieve?
import csv
def readCsv(filename):
d = {}
with open(filename) as csvfile:
readCSV = csv.reader(csvfile, delimiter='\t')
for row in readCSV:
d.setdefault(row[0], []).append(row[1])
return d
print(readCsv('test.csv'))
I am trying to read a csv and then transpose one column into a row.
I tried following a tutorial for reading a csv and then one for writing but the data doesnt stay saved to the list when I try to write the row.
import csv
f = open('bond-dist-rep.csv')
csv_f = csv.reader(f)
bondlength = []
with open("bond-dist-rep.csv") as f:
for row in csv_f:
bondlength.append(row[1])
print (bondlength)
print (len(bondlength))
with open('joined.csv', 'w', newline='') as csvfile:
csv_a = csv.writer (csvfile, delimiter=',',quotechar='"',
quoting=csv.QUOTE_ALL)
csv_a.writerow(['bondlength'])
with open('joined.csv') as csvfile:
readCSV = csv.reader(csvfile, delimiter=',')
for row in readCSV:
print(row)
print(row[0])
f.close()
The matter is that you only read the first value of each line and write only a string in the new file.
In order to transpose the read lines, you can use the zip function.
I also delete the first open function which is useless because of the good use of with for opening the file.
Here the final code:
import csv
bondlength = []
with open("bond-dist-rep.csv") as csv_f:
read_csv = csv.reader(csv_f)
for row in read_csv:
bondlength.append(row)
# delete the header if you have one
bondlength.pop(0)
with open('joined.csv', 'w') as csvfile:
csv_a = csv.writer (csvfile, delimiter=',')
for transpose_row in zip(*bondlength):
csv_a.writerow(transpose_row)