create dataframe of liste - python-3.x

I want to create dataframe form existing lists( each row of file will be written in row dataframe.
with open(filename, mode='r', encoding='cp1252') as f:
lines=f.readlines()
liste1 = str(lines[0])
df1 = pd.DataFrame(liste1)
who can help me please?
below the 3 first rows of file f1.
[‘x1’, ‘major’, ’1198’, ‘TCP’]
[‘x1’, ‘minor’, ‘1198’, ‘UDP’]
[‘x2’, ‘major’, ’1198’, ‘UDP’]

If I understand this properly, want each row in the DataFrame to be a string you read from a line in the file?
Note that liste in your case is a string so I am not sure what you are going for.
This approach should work anyways.
import pandas as pd
df1 = pd.DataFrame()
with open(filename, mode='r', encoding='cp1252') as f:
lines=f.readlines()
liste1 = str(lines[0])
df1 = df1.append(pd.Series(liste1), ignore_index=True)
So if liste1 has form
> "This is a string"
then your DataFrame will look like this
df1.head()
0
0 This is a string
if liste1 has form
> ["This", "is", "a", "list"]
then your DataFrame will look like this
df1.head()
0 1 2 3
0 This is a list
You can then call this append() routine as many times as you want inside a loop.
However, I suspect that there is a function, such as pd.read_table(), that can do this all for you automatically (as #jezrael suggested in the comments to your question).

Related

pandas read_csv with data and headers in alternate columns

I have a generated CSV file that
doesn't have headers
has header and data occur alternately in every row (headers do not change from row to row).
E.g.:
imageId,0,feat1,30,feat2,34,feat,90
imageId,1,feat1,0,feat2,4,feat,89
imageId,2,feat1,3,feat2,3,feat,80
IMO, this format is redundant and cumbersome (I don't see why anyone would generate files in this format). The saner/normal CSV of the same data (which I can directly read using pd.read_csv():
imageId,feat1,feat2,feat
0,30,34,90
1,0,4,89
2,3,3,80
My question is, how do I read the original data into a pd dataframe? For now, I do a read_csv and then drop all alternate columns:
df=pd.read_csv(file, header=None)
df=df[range(1, len(df.columns), 2]
Problem with this is I don't get the headers, unless I make it a point to specify them.
Is there a simpler way of telling pandas that the format has data and headers in every row?
Select columns by indexing in DataFrame.iloc and set new columns names with get first row and pair values (assuming pair columns have same values like in sample data):
#default headers
df = pd.read_csv(file, header=None)
df1 = df.iloc[:, 1::2]
df1.columns = df.iloc[0, ::2].tolist()
print (df1)
imageId feat1 feat2 feat
0 0 30 34 90
1 1 0 4 89
2 2 3 3 80
I didn't measure but I would expect that it could be a problem to read the entire file (redundant headers and actual data) before filtering for the interesting stuff. So I tried to exploit the optional parameters nrows and usecols to (hopefully) limit the amount of memory needed to process the CSV input file.
# --- Utilities for generating test data ---
import random as rd
def write_csv(file, line_count=100):
with open(file, 'w') as f:
r = lambda : rd.randrange(100);
for i in range(line_count):
line = f"imageId,{i},feat1,{r()},feat2,{r()},feat,{r()}\n"
f.write(line)
file = 'text.csv'
# Generate a small CSV test file
write_csv(file, 10)
# --- Actual answer ---
import pandas as pd
# Read columns of the first row
dfi = pd.read_csv(file, header=None, nrows=1)
ncols = dfi.size
# Read data columns
dfd = pd.read_csv(file, header=None, usecols=range(1, ncols, 2))
dfd.columns = dfi.iloc[0, ::2].to_list()
print(dfd)

Convert a pandas dataframe to tab separated list in Python

I have a dataframe like below:
import pandas as pd
data = {'Words':['actually','he','came','from','home','and','played'],
'Col2':['2','0','0','0','1','0','3']}
data = pd.DataFrame(data)
The dataframe looks like this:
I write this dataframe into the drive using below command:
np.savetxt('/folder/file.txt', data.values,fmt='%s', delimiter='\t')
And the next script reads it with below line of code:
data = load_file('/folder/file.txt')
Below is load_file function to read a text file.
def load_file(filename):
with open(filename, 'r', encoding='utf-8') as f:
data = f.readlines()
return data
The data will be a tab separated list.
print(data)
gives me the following output:
['actually\t2\n', 'he\t0\n', 'came\t0\n', 'from\t0\n', 'home\t1\n', 'and\t0\n', 'played\t3\n']
I dont want to write the file to drive and then read it for processing. Instead I want to convert the dataframe to a tab separated list and process directly. How can I achieve this?
I checked for existing answers, but most just convert list to dataframe and not other way around.
Thanks in advance.
Try using .to_csv()
df_list = data.to_csv(header=None, index=False, sep='\t').split('\n')
df_list:
['actually\t2',
'he\t0',
'came\t0',
'from\t0',
'home\t1',
'and\t0',
'played\t3'
]
v = df.to_csv(header=None, index=False, sep='\t').rstrip().replace('\n', '\n\\n').split('\\n')
df_list:
['actually\t2\n',
'he\t0\n',
'came\t0\n',
'from\t0\n',
'home\t1\n',
'and\t0\n',
'played\t3\n'
]
I think this achieves the same result without writing to the drive:
df_list = list(data.apply(lambda row: row['Words'] + '\t' + row['Col2'] + '\n', axis=1))
Try:
data.apply("\t".join, axis=1).tolist()

How to get the full text file after merge?

I’m merging two text files file1.tbl and file2.tbl with a common column. I used pandas to make data frames of each and merge function to have the output.
The problem is the output file does not show me the whole data and there is a row of "..." instead and at the end it just prints [9997 rows x 5 columns].
I need a file containing the whole 9997 rows.
import pandas
with open("file1.tbl") as file:
d1 = file.read()
with open("file2.tbl") as file:
d2 = file.read()
df1 = pandas.read_table('file1.tbl', delim_whitespace=True, names=('ID', 'chromosome', 'strand'))
df2 = pandas.read_table('file2.tbl', delim_whitespace=True, names=('ID', 'NUClen', 'GCpct'))
merged_table = pandas.merge(df1, df2)
with open('merged_table.tbl', 'w') as f:
print(merged_table, file=f)

Python pandas read_csv for specfic records in columns

I am trying to import data from a large csv file 15GB+. I have to select few columns with specific values (there are over 50 columns) but as an example. I have used
df=pd.read_csv('filename.csv', nrows=10000, usecols=['ID', State'])
Is there a way where I can specify something like that:
df=pd.read_csv('filename.csv', nrows=10000, usecols=['ID', 'State'='abc'])
Can't find any option to do that
There's no option to filter rows like that while reading csv files.
What you can do is create an iterator then apply your filter to each chunk then concat the chunks. It would look something like:
iterable = pd.read_csv('filename.csv', usecols=['ID', 'State'], iterator=True, chunksize=10000)
df = pd.concat([chunk[chunk['State'] == 'abc'] for chunk in iterable])
Assuming that the resulting DataFrame for a selection where 'State' == 'abc' is small enough to be accommodated in RAM, you could extract those from the csv as follows. df is the resultant DataFrame.
import pandas as pd
inPath = 'filename.csv'
chunkSize = 10000 #size of chunks relies on your available memory
tmpDf = pd.read_csv(inPath,chunksize=chunkSize,
usecols=['ID', 'State'])
for chunk in tmpDf:
try:
df
except NameError:
df = tmpDf[tmpDf['State'] == 'abc']
else:
df = pd.concat([df, tmpDf[tmpDf['State'] == 'abc']])

Creating multiple dataframes with a loop

This undoubtedly reflects lack of knowledge on my part, but I can't find anything online to help. I am very new to programming. I want to load 6 csvs and do a few things to them before combining them later. The following code iterates over each file but only creates one dataframe, called df.
files = ('data1.csv', 'data2.csv', 'data3.csv', 'data4.csv', 'data5.csv', 'data6.csv')
dfs = ('df1', 'df2', 'df3', 'df4', 'df5', 'df6')
for df, file in zip(dfs, files):
df = pd.read_csv(file)
print(df.shape)
print(df.dtypes)
print(list(df))
Use dictionary to store you DataFrames and access them by name
files = ('data1.csv', 'data2.csv', 'data3.csv', 'data4.csv', 'data5.csv', 'data6.csv')
dfs_names = ('df1', 'df2', 'df3', 'df4', 'df5', 'df6')
dfs ={}
for dfn,file in zip(dfs_names, files):
dfs[dfn] = pd.read_csv(file)
print(dfs[dfn].shape)
print(dfs[dfn].dtypes)
print(dfs['df3'])
Use list to store you DataFrames and access them by index
files = ('data1.csv', 'data2.csv', 'data3.csv', 'data4.csv', 'data5.csv', 'data6.csv')
dfs = []
for file in files:
dfs.append( pd.read_csv(file))
print(dfs[len(dfs)-1].shape)
print(dfs[len(dfs)-1].dtypes)
print (dfs[2])
Do not store intermediate DataFrame, just process them and add to resulting DataFrame.
files = ('data1.csv', 'data2.csv', 'data3.csv', 'data4.csv', 'data5.csv', 'data6.csv')
df = pd.DataFrame()
for file in files:
df_n = pd.read_csv(file)
print(df_n.shape)
print(df_n.dtypes)
# do you want to do
df = df.append(df_n)
print (df)
If you will process them differently, then you do not need a general structure to store them. Do it simply independent.
df = pd.DataFrame()
def do_general_stuff(d): #here we do common things with DataFrame
print(d.shape,d.dtypes)
df1 = pd.read_csv("data1.csv")
# do you want to with df1
do_general_stuff(df1)
df = df.append(df1)
del df1
df2 = pd.read_csv("data2.csv")
# do you want to with df2
do_general_stuff(df2)
df = df.append(df2)
del df2
df3 = pd.read_csv("data3.csv")
# do you want to with df3
do_general_stuff(df3)
df = df.append(df3)
del df3
# ... and so on
And one geeky way, but don't ask how it works:)
from collections import namedtuple
files = ['data1.csv', 'data2.csv', 'data3.csv', 'data4.csv', 'data5.csv', 'data6.csv']
df = namedtuple('Cdfs',
['df1', 'df2', 'df3', 'df4', 'df5', 'df6']
)(*[pd.read_csv(file) for file in files])
for df_n in df._fields:
print(getattr(df, df_n).shape,getattr(df, df_n).dtypes)
print(df.df3)
I think you think your code is doing something that it is not actually doing.
Specifically, this line: df = pd.read_csv(file)
You might think that in each iteration through the for loop this line is being executed and modified with df being replaced with a string in dfs and file being replaced with a filename in files. While the latter is true, the former is not.
Each iteration through the for loop is reading a csv file and storing it in the variable df effectively overwriting the csv file that was read in during the previous for loop. In other words, df in your for loop is not being replaced with the variable names you defined in dfs.
The key takeaway here is that strings (e.g., 'df1', 'df2', etc.) cannot be substituted and used as variable names when executing code.
One way to achieve the result you want is store each csv file read by pd.read_csv() in a dictionary, where the key is name of the dataframe (e.g., 'df1', 'df2', etc.) and value is the dataframe returned by pd.read_csv().
list_of_dfs = {}
for df, file in zip(dfs, files):
list_of_dfs[df] = pd.read_csv(file)
print(list_of_dfs[df].shape)
print(list_of_dfs[df].dtypes)
print(list(list_of_dfs[df]))
You can then reference each of your dataframes like this:
print(list_of_dfs['df1'])
print(list_of_dfs['df2'])
You can learn more about dictionaries here:
https://docs.python.org/3.6/tutorial/datastructures.html#dictionaries
A dictionary can store them too
import pandas as pd
from pprint import pprint
files = ('doms_stats201610051.csv', 'doms_stats201610052.csv')
dfsdic = {}
dfs = ('df1', 'df2')
for df, file in zip(dfs, files):
dfsdic[df] = pd.read_csv(file)
print(dfsdic[df].shape)
print(dfsdic[df].dtypes)
print(list(dfsdic[df]))
print(dfsdic['df1'].shape)
print(dfsdic['df2'].shape)

Resources