How to get the full text file after merge? - python-3.x

I’m merging two text files file1.tbl and file2.tbl with a common column. I used pandas to make data frames of each and merge function to have the output.
The problem is the output file does not show me the whole data and there is a row of "..." instead and at the end it just prints [9997 rows x 5 columns].
I need a file containing the whole 9997 rows.
import pandas
with open("file1.tbl") as file:
d1 = file.read()
with open("file2.tbl") as file:
d2 = file.read()
df1 = pandas.read_table('file1.tbl', delim_whitespace=True, names=('ID', 'chromosome', 'strand'))
df2 = pandas.read_table('file2.tbl', delim_whitespace=True, names=('ID', 'NUClen', 'GCpct'))
merged_table = pandas.merge(df1, df2)
with open('merged_table.tbl', 'w') as f:
print(merged_table, file=f)

Related

pandas read_csv with data and headers in alternate columns

I have a generated CSV file that
doesn't have headers
has header and data occur alternately in every row (headers do not change from row to row).
E.g.:
imageId,0,feat1,30,feat2,34,feat,90
imageId,1,feat1,0,feat2,4,feat,89
imageId,2,feat1,3,feat2,3,feat,80
IMO, this format is redundant and cumbersome (I don't see why anyone would generate files in this format). The saner/normal CSV of the same data (which I can directly read using pd.read_csv():
imageId,feat1,feat2,feat
0,30,34,90
1,0,4,89
2,3,3,80
My question is, how do I read the original data into a pd dataframe? For now, I do a read_csv and then drop all alternate columns:
df=pd.read_csv(file, header=None)
df=df[range(1, len(df.columns), 2]
Problem with this is I don't get the headers, unless I make it a point to specify them.
Is there a simpler way of telling pandas that the format has data and headers in every row?
Select columns by indexing in DataFrame.iloc and set new columns names with get first row and pair values (assuming pair columns have same values like in sample data):
#default headers
df = pd.read_csv(file, header=None)
df1 = df.iloc[:, 1::2]
df1.columns = df.iloc[0, ::2].tolist()
print (df1)
imageId feat1 feat2 feat
0 0 30 34 90
1 1 0 4 89
2 2 3 3 80
I didn't measure but I would expect that it could be a problem to read the entire file (redundant headers and actual data) before filtering for the interesting stuff. So I tried to exploit the optional parameters nrows and usecols to (hopefully) limit the amount of memory needed to process the CSV input file.
# --- Utilities for generating test data ---
import random as rd
def write_csv(file, line_count=100):
with open(file, 'w') as f:
r = lambda : rd.randrange(100);
for i in range(line_count):
line = f"imageId,{i},feat1,{r()},feat2,{r()},feat,{r()}\n"
f.write(line)
file = 'text.csv'
# Generate a small CSV test file
write_csv(file, 10)
# --- Actual answer ---
import pandas as pd
# Read columns of the first row
dfi = pd.read_csv(file, header=None, nrows=1)
ncols = dfi.size
# Read data columns
dfd = pd.read_csv(file, header=None, usecols=range(1, ncols, 2))
dfd.columns = dfi.iloc[0, ::2].to_list()
print(dfd)

what is wrong with this Pandas and txt file code

I'm using pandas to open a CSV file that contains data from spotify, meanwhile, I have a txt file that contains various artists names from that CSV file. What I'm trying to do is get the value from each row of the txt and automatically search them in the function I've done.
import pandas as pd
import time
df = pd.read_csv("data.csv")
df = df[['artists', 'name', 'year']]
def buscarA():
start = time.time()
newdf = (df.loc[df['artists'].str.contains(art)])
stop = time.time()
tempo = (stop - start)
print (newdf)
e = ('{:.2f}'.format(tempo))
print (e)
with open("teste3.txt", "r") as f:
for row in f:
art = row
buscarA()
but the output is always the same:
Empty DataFrame
Columns: [artists, name, year]
Index: []
The problem here is that when you read the lines of your file in Python, it also gets the line break per row so that you have to strip it off.
Let's suppose that the first line of your teste3.txt file is "James Brown". It'd be read as "James Brown\n" and not recognized in the search.
Changing the last chunk of your code to:
with open("teste3.txt", "r") as f:
for row in f:
art = row.strip()
buscarA()
should work.

Converting texts in to csv file under separate columns

I have the following in a .txt file
1.['LG','Samsung','Asus','HP','Apple','HTC']
2.['covid','vaccine','infection','cure','chloroquine']
3.['p2p','crypto','bitcoin','litecoin','blockchain']
How do I convert the above into a csv file under different columns?
My current code is this
import csv
with open('Full_txt_results.txt', 'r') as in_file:
stripped = (line.strip() for line in in_file)
lines = (line.split(",") for line in stripped if line)
with open('textlabels.csv', 'w') as out_file:
writer = csv.writer(out_file)
writer.writerows(lines)
the code currently gives the result in the following format in csv
Column 1 Column2 column 3 column 4 Column 5 column 6
['LG' 'Samsung' 'Asus' 'HP' 'Apple' 'HTC']
['covid' 'vaccine' 'infection' 'cure' 'chloroquine']
['p2p' 'crypto' 'bitcoin' 'litecoin' 'blockchain']
The texts are spilled in to different columns.
Ideal output required is in the below format
Column 1 Column2 column 3
LG Covid p2p
Samsung Vaccine crypto
Asus Infection bitcoin
HP cure litecoin
Apple chloroquine blockchain
HTC
Use ast module to convert string to list object and then write to csv using writerow method
Ex:
import csv
import ast
with open('Full_txt_results.txt') as in_file, open('textlabels.csv', 'w', newline="") as out_file:
writer = csv.writer(out_file)
data = [ast.literal_eval(line.strip().split(".")[1]) for line in in_file] #If you do not have column number(1.,2.,...) Use [ast.literal_eval(line.strip()) for line in in_file]
for row in zip(*data):
writer.writerow(row)
Demo:
import csv
import ast
with open(filename) as in_file, open(outfile, 'w', newline="") as out_file:
writer = csv.writer(out_file)
data = [ast.literal_eval(line.strip()) for line in in_file]
for row in zip(*data):
writer.writerow(row)
SRC txt file
['LG','Samsung','Asus','HP','Apple','HTC']
['covid','vaccine','infection','cure','chloroquine']
['p2p','crypto','bitcoin','litecoin','blockchain']
Output:
LG,covid,p2p
Samsung,vaccine,crypto
Asus,infection,bitcoin
HP,cure,litecoin
Apple,chloroquine,blockchain

create dataframe of liste

I want to create dataframe form existing lists( each row of file will be written in row dataframe.
with open(filename, mode='r', encoding='cp1252') as f:
lines=f.readlines()
liste1 = str(lines[0])
df1 = pd.DataFrame(liste1)
who can help me please?
below the 3 first rows of file f1.
[‘x1’, ‘major’, ’1198’, ‘TCP’]
[‘x1’, ‘minor’, ‘1198’, ‘UDP’]
[‘x2’, ‘major’, ’1198’, ‘UDP’]
If I understand this properly, want each row in the DataFrame to be a string you read from a line in the file?
Note that liste in your case is a string so I am not sure what you are going for.
This approach should work anyways.
import pandas as pd
df1 = pd.DataFrame()
with open(filename, mode='r', encoding='cp1252') as f:
lines=f.readlines()
liste1 = str(lines[0])
df1 = df1.append(pd.Series(liste1), ignore_index=True)
So if liste1 has form
> "This is a string"
then your DataFrame will look like this
df1.head()
0
0 This is a string
if liste1 has form
> ["This", "is", "a", "list"]
then your DataFrame will look like this
df1.head()
0 1 2 3
0 This is a list
You can then call this append() routine as many times as you want inside a loop.
However, I suspect that there is a function, such as pd.read_table(), that can do this all for you automatically (as #jezrael suggested in the comments to your question).

How do I remove first column in csv file?

I have a CSV file where the first row in the first column is blank with some numbers in the second and third row. This whole column is useless and I need to remove it so I can convert the data into a JSON file. I just need to know how to remove the first column of data so I can parse it. Any help is greatly appreciated!
My script is as follows
#!/usr/bin/python3
import pandas as pd
import csv, json
xls = pd.ExcelFile(r'C:\Users\Andy-\Desktop\Lab2Data.xlsx')
df = xls.parse(sheetname="Sheet1", index_col=None, na_values=['NA'])
df.to_csv('file.csv')
file = open('file.csv', 'r')
lines = file.readlines()
file.close()
data = {}
with open('file.csv') as csvFile:
csvReader = csv.DictReader(csvFile)
for rows in csvReader:
id = rows['Id']
data[id] = rows
with open('Lab2.json', 'w') as jsonFile:
jsonFile.write(json.dumps(data, indent=4))
I don't know much about json files but this will remove the first column from your csv file.
with open ('new_file.csv', 'w') as out_file :
with open ('file.csv') as in_file :
for line in in_file :
test_string = line.strip ('\n').split (',')
out_file.write (','.join (test_string [1:]) + '\n')

Resources