Using pd.read_table() multiple times on same open file - python-3.x

I have a data structure of the following form:
**********DATA:0************
name_A name_B
0.16561919 0.03640960
0.39564838 0.66708115
0.60828075 0.95785214
0.68716186 0.92803331
0.80615505 0.96219926
**********data:0************
**********DATA:1************
name_A name_B
0.32474381 0.82506909
0.30934914 0.60406956
0.99519513 0.23425607
0.72210821 0.61141751
0.47362605 0.09892009
**********data:1************
**********DATA:2************
name_A name_B
0.46561919 0.13640960
0.29564838 0.66708115
0.40828075 0.35785214
0.08716186 0.52803331
0.70615505 0.96219926
**********data:2************
I would like to read each block to a seperate pandas dataframe with appropriate header titles. When I use the simple function below, only a single data block is stored in the output list. However, when I comment out the data.append(pd.read_table(file, nrows=5)) line, the function prints all individual headers. The pandas read_table call seems to break out of the loop.
import pandas as pd
def read_data(filename):
data = []
with open(filename) as file:
for line in file:
if "**********DATA:" in line:
print(line)
data.append(pd.read_table(file, nrows=5))
return data
read_data("data_file.txt")
How should I change the function to read all blocks?

I suggest a slightly different approach, in which you avoid using read_table and put dataframes in a dict instead of a list, like this:
import pandas as pd
def read_data(filename):
data = {}
i = 0
with open(filename) as file:
for line in file:
if "**********DATA:" in line:
data[i] = []
continue
if "**********data:" in line:
i += 1
data[i] = []
continue
else:
data[i].append(line.strip("\n").split(" "))
return {
f"data_{k}": pd.DataFrame(data=v[1:], columns=v[0])
for k, v in data.items()
if v
}
And so, with the text file you gave as input:
dfs = read_data("data_file.txt")
print(dfs["data_0"])
# Output
name_A name_B
0 0.16561919 0.03640960
1 0.39564838 0.66708115
2 0.60828075 0.95785214
3 0.68716186 0.92803331
4 0.80615505 0.96219926
print(dfs["data_1"])
# Output
name_A name_B
0 0.32474381 0.82506909
1 0.30934914 0.60406956
2 0.99519513 0.23425607
3 0.72210821 0.61141751
4 0.47362605 0.09892009
print(dfs["data_2"])
# Output
name_A name_B
0 0.46561919 0.13640960
1 0.29564838 0.66708115
2 0.40828075 0.35785214
3 0.08716186 0.52803331
4 0.70615505 0.96219926

Related

How do l load a csv from a class/def function, and how do I then make it read/print both rows and columns?

I have a utils.py and a main.py. In the utils.py file I want all my data load, formula defs and so on. I want to create a class Data_load() and make that ensure load of data I can pull directly from main.py.
I have this:
utils.py:
def readMyFile(filename):
file = []
with open(filename) as csvDataFile:
csvReader = csv.reader(csvDataFile, delimiter=';', )
for row in csvReader:
file.append(row[0])
return file
file = readMyFile('C:\\...\\count_all_terminate.csv')
file_load = pd.DataFrame(file)
Got this:
main.py reads (one column only and with no header??!!):
0
0 User Name
1 146166
2 146166
3 146166
4 146166
... ...
3987 200589
3988 194018
3989 194449
3990 174565
3991 175440
I wanted main.py to read this:
0 col 2 col 3 col n
0 User Name
1 146166
2 146166
3 146166
4 146166
... ...
3987 200589
3988 194018
3989 194449
3990 174565
3991 175440
How do I
place the def in class, something like the following...
class Data_load():
def __init__(self, ....):
self
def readMyFile(filename):
file = []
with open(filename) as csvDataFile:
csvReader = csv.reader(csvDataFile, delimiter=';', )
for row in csvReader:
file.append(row[0])
return file
..and how do I make it print all the columns I know to exist in the 'count_all_terminate.csv' file? Any help is appreciated, Happy New Year from Hubsandspokes
If you want to use a class, I would employ pandas. You could do it standalone as follows:
import pandas as pd
df = pd.DataFrame.read_csv(filepath, **kargs) #see docs for **kargs definitions
df_stats = df.describe() #Provides summary statistics on each column see docs for
# description
If you desire a class,
class Data_load:
def __init__(self, filepath, **kargs):
self.df = pd.read_csv(filepath, **kargs)
def summary_stats(self):
return seld.df.describe()
Then to use:
filepath = r"path to csv file of interest"
myData = Data_load(filepath)
EDA = myData.summary_stats()

How to prepend a string to a column in csv

I have a csv with 5 columns in in.csv and I want to prepend all the data in column 2 with "text" in this same file such that the row
data1 data2 data3 data4 data5
becomes
data1 textdata2 data3 data4 data5
I thought using regex might be a good idea, but I'm not sure how to proceed
Edit:
After proceeding according to bigbounty's answer, I used the following script:
import pandas as pd
df = pd.read_csv("in.csv")
df["id_str"] = str("text" + str(df["id_str"]))
df.to_csv("new_in.csv", index=False)
My in.out file is like:
s_no,id_str,screen_name
1,1.15017060743203E+018,lorem
2,1.15015544419693E+018,ipsum
3,1.15015089995785E+018,dolor
4,1.15015054311063E+018,sit
After running the script, the new_in.csv file is:
s_no,id_str,screen_name
1,"text0 1.150171e+18
1 1.150155e+18
2 1.150151e+18
3 1.150151e+18
Name: id_str, dtype: float64",lorem
2,"text0 1.150171e+18
1 1.150155e+18
2 1.150151e+18
3 1.150151e+18
Name: id_str, dtype: float64",ipsum
3,"text0 1.150171e+18
1 1.150155e+18
2 1.150151e+18
3 1.150151e+18
Name: id_str, dtype: float64",dolor
4,"text0 1.150171e+18
1 1.150155e+18
2 1.150151e+18
3 1.150151e+18
Name: id_str, dtype: float64",sit
Whereas it should be:
s_no,id_str,screen_name
1,text1.15017060743203E+018,lorem
2,text1.15015544419693E+018,ipsum
3,text1.15015089995785E+018,dolor
4,text1.15015054311063E+018,sit
This can easily be done using pandas.
import pandas as pd
df = pd.read_csv("in.csv")
df["data2"] = "text" + df["data2"].astype(str)
df.to_csv("new_in.csv", index=False)
Using the csv module
import csv
with open('test.csv', 'r+', newline='') as f:
data = list(csv.reader(f)) # produces a list of lists
for i, r in enumerate(data):
if i > 0: # presumes the first list is a header and skips it
r[1] = 'text' + r[1] # add text to the front of the text at index 1
f.seek(0) # find the beginning of the file
writer = csv.writer(f)
writer.writerows(data) # write the new data back to the file
# the resulting text file
s_no,id_str,screen_name
1,text1.15017060743203E+018,lorem
2,text1.15015544419693E+018,ipsum
3,text1.15015089995785E+018,dolor
4,text1.15015054311063E+018,sit
Using pandas
This solution is agnostic of any column names because it uses column index.
pandas.DataFrame.iloc
import pandas as pd
# read the file set the column at index 1 as str
df = pd.read_csv('test.csv', dtype={1: str})
# add text to the column at index 1
df.iloc[:, 1] = 'text' + df.iloc[:, 1]
# write to csv
df.to_csv('test.csv', index=False)
# resulting csv
s_no,id_str,screen_name
1,text1.15017060743203E+018,lorem
2,text1.15015544419693E+018,ipsum
3,text1.15015089995785E+018,dolor
4,text1.15015054311063E+018,sit

Creating two columns from an unstructured file of IDs and sequences

Problem: Working with python 3.x, I have a file called input.txt with content as below
2345673 # First ID
0100121102020211111002 # first sequence (seq) which is long and goes to several lines
0120102100211001101200
6758442 #Second ID
0202111100011111022222 #second sequence (seq) which is long and goes to several lines
0202111110001120211210
0102101011211001101200
What i want: To process input.txt and save the results in output.csv and when i read it in pandas the
result should be a data frame like below.
ID Seq
2345673 0 1 0 0 1 2 1 1 0 2 …
6758442 0 2 0 2 1 1 1 1 0 0 …
Below is my code
with open("input.txt") as f:
with open("out.csv", "w") as f1:
for i, line in enumerate(f): #read each line in file
if(len(line) < 15 ): #check if length line is say < 15
id = line # if yes, make line ID
else:
seq = line # if not make it a sequence
#print(id)
lines = []
lines.append(','.join([str(id),str(seq)]))
for l in lines:
f1.write('('+l+'),\n') #write to file f1
when i read out.csv in pandas the output is not what i want. see below. Please i will appreciate your help , i am really stocked.
(2345673
,0100121102020211111002
),
(2345673
,0120102100211001101200
),
(6758442
,0202111100011111022222
),
(6758442
,0202111110001120211210
),
(6758442
,0102101011211001101200),
import pandas as pd
### idea is to create two lists: one with ids and another with sequences
with open("input.txt") as f:
ids=[]
seqs=[]
seq=""
for i, line in enumerate(f):
if (len(line) < 15 ) :
seqs.append(seq)
id=line
id=id.rstrip('\n')
id=id.rstrip(' ')
ids.append(id)
seq=""
else:
#next three lines combine all sequences that correspond the same id into one
additional_seq = line.rstrip('\n')
additional_seq = additional_seq.rstrip(' ')
seq+=additional_seq
seqs.append(seq)
seqs=seqs[1:]
df = pd.DataFrame(list(zip(ids, seqs)), columns =['id', 'seq'])
df.to_scv("out.csv",index=False)

Create file with reference to a file filtered based on common information from other file in python

I have 2 text files. file1 has 6 columns and 2 rows but file2 has 2 columns and 5 rows like these examples:
file1:
Code S1 S2 S3 S4 S5
X2019060656_12 4.068522 1889.299282 1547.771971 434.392935 4346.019078
X2019060657_05 1.318325 1290.142988 285.579601 73.329331 2222.198520
file2:
Class group
X2019060656_12 A
X2019060657_05 A
X2019060658_04 A
X2019060659_03 A
X2019060660_08 A
I would like to make a subset of file2 filtered based on the intersection of the column "Class" of file2 and the column "Code" in file1.
This is the expected output:
Class group
X2019060656_12 A
X2019060657_05 A
To do so, I made the following code in python:
file1 = open("file1.txt", "r")
file2 = open("file2.txt", "r")
file1 = {}
keys1 = []
values1 = []
with open("file1.txt") as file1:
for line in file1.lines():
keys1.append(line[0])
values1.append(line[1])
dict_file1 = dict(zip(keys1, values1))
file2 = {}
keys2 = []
values2 = []
with open("file2.txt") as file2:
for line in file2.lines():
keys2.append(line[0])
values2.append(line[1])
dict_file2 = dict(zip(keys2, values2))
newlist = []
for item in dict_file1:
for item2 in dict_file2:
if item1 == item2:
new_list.append(line)
with open('new_file.txt', 'w') as f:
for i in new_list:
f.write("%s\n" % i)
but the output file is not like the expected output. Do you know how to fix it?
You can do this with pandas like this:
import pandas as pd
df1 = pd.read_csv("file1.txt",delim_whitespace=True)
df2 = pd.read_csv("file2.txt",delim_whitespace=True)
df2[df2['Class'].isin(df1['Code'])]
Output:
Class group
0 X2019060656_12 A
1 X2019060657_05 A
If you want to export to file, use df2.to_csv

how to remove the quotations from the string in python?

the csv file returns the column value as dictionary format.but i cant get the value from dictionary by using dic.get("name") .it shows an error like ['str' object has no attribute 'get'].the actual problem is csv return the dict with quates so the python consider this as string.how to remove the quates and how can i fix it. please help!!
with open('file.csv') as file:
reader=csv.reader(file)
count=0
for idx,row in enumerate(reader):
dic=row[5]
if(idx==0):
continue
else:
print(dic.get("name"))
filename file_size file_attributes region_count region_id region_shape_attributes region_attributes
adutta_swan.jpg -1 {"caption":"Swan in lake Geneve","public_domain":"no","image_url":"http://www.robots.ox.ac.uk/~vgg/software/via/images/swan.jpg"} 1 0 {"name":"rect","x":82,"y":105,"width":356,"height":207} {"name":"not_defined","type":"unknown","image_quality":{"good":true,"frontal":true,"good_illumination":true}}
wikimedia_death_of_socrates.jpg -1 {"caption":"The Death of Socrates by David","public_domain":"yes","image_url":"https://en.wikipedia.org/wiki/The_Death_of_Socrates#/media/File:David_-_The_Death_of_Socrates.jpg"} 3 0 {"name":"rect","x":174,"y":139,"width":108,"height":227} {"name":"Plato","type":"human","image_quality":{"good_illumination":true}}
wikimedia_death_of_socrates.jpg -1 {"caption":"The Death of Socrates by David","public_domain":"yes","image_url":"https://en.wikipedia.org/wiki/The_Death_of_Socrates#/media/File:David_-_The_Death_of_Socrates.jpg"} 3 1 {"name":"rect","x":347,"y":114,"width":91,"height":209} {"name":"Socrates","type":"human","image_quality":{"frontal":true,"good_illumination":true}}
wikimedia_death_of_socrates.jpg -1 {"caption":"The Death of Socrates by David","public_domain":"yes","image_url":"https://en.wikipedia.org/wiki/The_Death_of_Socrates#/media/File:David_-_The_Death_of_Socrates.jpg"} 3 2 {"name":"ellipse","cx":316,"cy":180,"rx":17,"ry":12} {"name":"Hemlock","type":"cup"}
Use DictReader which reads the csv as a dictionary!
import csv
import json
with open('graph.csv') as file:
#Read csv as dictreader
reader=csv.DictReader(file)
count=0
#Iterate through rows
for idx,row in enumerate(reader):
#Load the string as a dictionary
region_shape_attributes = json.loads(row['region_shape_attributes'])
print(region_shape_attributes['name'])
`import csv
import ast
with open('file.csv') as file:
#Read csv as dictreader
reader=csv.DictReader(file)
count=0
#Iterate through rows
for idx,row in enumerate(reader):
#print(row)
row_5=row['region_shape_attributes']
y=ast.literal_eval(row_5)
print(y.get("name"))`
this code also work to me

Resources