How to split csv file at some specific rows index (array of indexes) - python-3.x

I am a beginner in python. I have a CSV file that has like 3 M rows and I wanted to split this csv file into some csv files based on a list of indexes
for example I have the list of the index where I need to do the splitting like
Index [0 , 1000 , 5000 , ... etc ]
so I wanted to split the CSV file based on their indexes like the first csv file has the data from row 0 to row number 1000 and the second CSV file has the data from row number 1001 to number 5000 and so on
I tried to use something like that but actually, it doesn't work i think i have a problem in the loop
with open('input_0.csv', 'r') as f:
csvfile = f.readlines()
filename = 1
for i in range(0,len(csvfile)):
with open(str(filename) + '.csv', 'w+') as f:
if filename > 1:
f.write(csvfile[0]) # header again
f.writelines(csvfile[index[i]:1+index[i]])
filename += 1
Thanks in advance

Related

Extract numbers only from specific lines within a txt file with certain keywords in Python

I want to extract numbers only from lines in a txt file that have a certain keyword and add them up then compare them, and then print the highest total number and the lowest total number. How should I go about this?
I want to print the highest and the lowest valid total numbers
I managed to extract lines with "valid" keyword in them, but now I want to get numbers from this lines, and then add the numbers up of each line, and then compare these numbers with other lines that have the same keyword and print the highest and the lowest valid numbers.
my code so far
#get file object reference to the file
file = open("shelfs.txt", "r")
#read content of file to string
data = file.read()
#close file<br>
closefile = file.close()
#get number of occurrences of the substring in the string
totalshelfs = data.count("SHELF")
totalvalid = data.count("VALID")
totalinvalid = data.count("INVALID")
print('Number of total shelfs :', totalshelfs)
print('Number of valid valid books :', totalvalid)
print('Number of invalid books :', totalinvalid)
txt file
HEADER|<br>
SHELF|2200019605568|<br>
BOOK|20200120000000|4810.1|20210402|VALID|<br>
SHELF|1591024987400|<br>
BOOK|20200215000000|29310.0|20210401|VALID|<br>
SHELF|1300001188124|<br>
BOOK|20200229000000|11519.0|20210401|VALID|<br>
SHELF|1300001188124|<br>
BOOK|20200329001234|115.0|20210331|INVALID|<br>
SHELF|1300001188124|<br>
BOOK|2020032904567|1144.0|20210401|INVALID|<br>
FOOTER|
What you need is to use the pandas library.
https://pandas.pydata.org/
You can read a csv file like this:
data = pd.read_csv('shelfs.txt', sep='|')
it returns a DataFrame object that can easily select or sort your data. It will use the first row as header, then you can select a specific column like a dictionnary:
header = data['HEADER']
header is a Series object.
To select columns you can do:
shelfs = data.loc[:,data['HEADER']=='SHELF']
to select only the row where the header is 'SHELF'.
I'm just not sure how pandas will handle the fact that you only have 1 header but 2 or 5 columns.
Maybe you should try to create one header per colmun in your csv, and add separators to make each row the same size first.
Edit (No External libraries or change in the txt file):
# Split by row
data = data.split('<br>\n')
# Split by col
data = [d.split('|') for d in data]
# Fill empty cells
n_cols = max([len(d) for d in data])
for i in range(len(data)):
while len(data[i])<n_cols:
data[i].append('')
# List VALID rows
valid_rows = [d for d in data if d[4]=='VALID']
# Get sum min and max
valid_sum = [d[1]+d[2]+d[3] for d in valid_rows]
valid_minimum = min(valid_sum)
valid_maximum = max(valid_sum)
It's maybe not exactly what you want to do but it solves a part of your problem. I didn't test the code.

Python Ignoring few x number of lines in CSV file

So I'm trying to read from a CSV file that's tab-delimited saved as a .DAT file.
I need to skip everything until I get to the data portion under the Date header.
So 05-29-2012 and everything on that row and rows below it. I found plenty of documentation on how to skip the first few lines, but I don't know how many lines that may be. From the
"Data file created", to the meat of the data may have more lines of text than another file. Could
be 3 rows, could be 10 rows.
I have thousands of these files I'm trying to extract the data out of and plot it. Easy in excel just to cut and paste but I'm going for efficiency here.
This is the code I'm using. I see that data perfectly. Only IF I know how many lines to skip. There will be blanks lines, I get how to bypass those, but if I have text there, that adds the extra lines I can't bypass.
import pandas as pd
import csv
myfile = ('E:\\TTF Data Backup\\1X ARRAY MOD #2.dat')
df = pd.read_csv(myfile, skiprows=3 , delimiter='\t')
print(df.head(20))
Data File
Try this code:
myfile = ('E:\TTF Data Backup\1X ARRAY MOD #2.dat')
skipcnt = 0
with open(myfile) as f: # auto closes after loop
for row in f:
skipcnt += 1
if "Tension" in row and "Elong" in row: # top of header
break;
skipcnt += 3 # skip headers
df = pd.read_csv(myfile, skiprows=skipcnt , delimiter='\t')

How to combine multiple csv files based on file name

I have more than 1000 csv files , i want to combine where csv filename first five digits are same in to one csv file.
input:
100044566.csv
100040457.csv
100041458.csv
100034566.csv
100030457.csv
100031458.csv
100031459.csv
import pandas as pd
import os
import glob
path_1 =''
all_files_final = glob.glob(os.path.join(path_1, "*.csv"))
names_1 = [os.path.basename(x1) for x1 in all_files_final]
final = pd.DataFrame()
for file_1, name_1 in zip(all_files_final, names_1):
file_df_final = pd.read_csv(file_1,index_col=False)
#file_df['file_name'] = name
final = final.append(file_df_final)
final.to_csv('',index=False)
i used the above code but its merging all files in to one csv file , i dont know have to make selection based on the name
so from above input
output 1: combine first three csv files in one csv file because filename first five digits are same.
output 2: combine next 4 files in one csv files because filename first five digits are same.
I would recommend you to approach the problem slightly differently.
Here's my solution:
import os
import pandas as pd
files = os.listdir('.') # returns list of filenames in current folder
files_of_interest = {} # a dictionary that we will be using in future
for filename in files: # iterate over files in a folder
if filename[-4:] == '.csv': # check whether a file is of .csv format
key = filename[:5] # as you've mentioned in you question - first five characters of filename is of interest
files_of_interest.setdefault(key,[]) #if we dont have such key - .setdefault will create such key for us and assign empy list to it
files_of_interest[key].append(filename) # append to a list new filename
for key in files_of_interest:
buff_df = pd.DataFrame()
for filename in files_of_interest[key]:
buff_df= buff_df.append(pd.read_csv(filename)) # iterate over every filename for specific key in dictionary and appending it to buff_df
files_of_interest[key]=buff_df # replacing list of files by a data frame
This code will create a dictionary of dataframes. Where keys of the dictionary will be a set of first unique characters of .csv files.
Then you can iterate over keys of the dictionary to save every according dataframe as a .csv file.
Hope my answer helped.

Python: How to sum a column in a CSV file while skipping the header row

Trying to sum a column in a csv file that has a header row at the top. I'm trying to use this for loop but it's just return zero. Any thoughts?
CSVFile = open('Data103.csv')
CSVReader = csv.reader(CSVFile) #you don't pass a file name directly to csv.reader
CSVDataList = list(CSVReader) #stores the csv file as a list of lists
print(CSVDataList[0][16])
total = 0
for row in CSVReader:
if CSVReader.line_num == 1:
continue
total += int(row[16])
print (total)
Here is what the data sample looks like in txt:
Value,Value,Value, "15,500.05", 00.00, 00.00
So the items are deliminted by , except in the case where they need an escape then it's "". It's a pretty standard file with a header row and about 1k lines of data across 18 columns.
You might want to use Pandas.
import pandas as pd
df = pd.read_csv('/path/to/file.csv')
column_sum = df['column_name'].sum()
It seems that you've over-indented the line that does the sum. It should be like this:
for row in CSVReader:
if CSVReader.line_num == 1:
continue
total += int(row[16])
Otherwise you'll only sum the values for the first row, which is exactly the one you want to skip.
EDIT:
Since you said the previous change doesn't work, I'd suggest working with the excellent Python lib called rows.
With the following CSV (fruits.csv):
id,name,amount
1,apple,3
2,banana,6
3,pineapple,2
4,lemon,5
You can access columns directly by their name instead of their index:
import rows
data = rows.import_from_csv('fruits.csv')
for fruit_data in data:
print(fruit_data.name, fruit_data.amount)
# output:
# apple 3
# banana 6
# pineapple 2
# lemon 5
NEW EDIT:
After you've provided the data, I believe in your case you could do something like:
import rows
data = rows.import_from_csv('Data103.csv')
print(data.field_names[16]) # prints the field name
total = 0
for row in data:
value = row.<column_name>
value = value.replace(',', '') # remove commas
total += float(value)
print (total)

Convert and concatenate data from two columns of a csv file

I have a csv file which contains data in two columns, as follows:
40500 38921
43782 32768
55136 49651
63451 60669
50550 36700
61651 34321
and so on...
I want to convert each data into it's hex equivalent, then concatenate them, and write them into a column in another csv file.
For example: hex(40500) = 9E34, and hex(38921) = 9809.
So, in output csv file, element A1 would be 9E349809
So, i am expecting column A in output csv file to be:
9E349809
AB068000
D760C1F3
F7DBECFD
C5768F5C
F0D38611
I referred a sample code which concatenates two columns, but am struggling with the converting them to hex and then concatenating them. Following is the code:-
import csv
inputFile = 'input.csv'
outputFile = 'output.csv'
with open(inputFile) as f:
reader = csv.reader(f)
with open(outputFile, 'w') as g:
writer = csv.writer(g)
for row in reader:
new_row = [''.join([row[0], row[1]])] + row[2:]
writer.writerow(new_row)
How can i convert data in each column to its hex equivalent, then concatenate them and write them in another file?
You could do this in 4 steps:
Read the lines from the input csv file
Use formatting options to get the hex values of each number
Perform string concatenation to get your result
Write to new csv file.
Sample Code:
with open (outputFile, 'w') as outfile:
with open (inputFile,'r') as infile:
for line in infile: # Iterate through each line
left, right = int(line.split()[0]), int(line.split()[1]) # split left and right blocks
newstr = '{:x}'.format(left)+'{:x}'.format(right) # create new string using hex values excluding '0x'
outfile.write(newstr) # write to output file
print ('Conversion completed')
print ('Closing outputfile')
Sample Output:
In[44] line = '40500 38921'
Out[50]: '9e349809'
ParvBanks solution is good (clear and functionnal), I would simplify it a little like that:
with open (inputFile,'r') as infile, open (outputFile, 'w+') as outfile:
for line in infile:
outfile.write("".join(["{:x}".format(int(v)) for v in line.split()]))

Resources