Writing to a tsv file from multiple list in Python - python-3.x

I have two lists
ids = [1,2,3]
values = [10,20,30]
I need to create a tsv file with two columns - id and result and put the ids and values in there. The output should look like this
id result
1 10
2 20
3 30
I wrote below code
output_columns = ['id','result']
data = zip(output_columns, ids, values)
with open('output.tsv', 'w', newline='') as f_output:
tsv_output = csv.writer(f_output, delimiter='\t')
tsv_output.writerow(data)
But this gives me an output like below which is wrong
('id', '1', '10') ('result', '2','20')
I understand that this wrong output is because the way I did zip to create a row of data. But I am not sure how to solve it.
Please suggest.

output_columns = ['id','result']
data = zip(ids, values)
with open('output.tsv', 'w', newline='') as f_output:
tsv_output = csv.writer(f_output, delimiter='\t')
tsv_output.writerow(output_columns)
for id, val in data:
tsv_output.writerow([id, val])

It's easier using pandas
In [8]: df = pd.DataFrame({"ids":[1,2,3], "values":[10,20,30]})
In [9]: df
Out[9]:
ids values
0 1 10
1 2 20
2 3 30
In [10]: df.to_csv("data.tsv", sep="\t", index=False)

Related

the robust way to combine multiple dataframe considering different input scenarios

There are three data frames, df_1, df_2 and df_3. I combined them as follows
result1 = df_1.append(df_2,ignore_index=True)
result2 = result1.append(df_3,ignore_index=True)
Then result2 is the combined dataframe. This code segment current works fine if neither of these three input data frames is empty.
However, in practice, any of these three input data frames can be empty. What is the most efficient approach to handle these different scenarios without implementing complex if-else logic to evaluate different scenarios, e.g., df_1 is empty, or both df_1 and df_3 are empty, etc.
IIUC use concat with list of Dataframes, it working if all or any DataFrame(s) are empty:
df_1 = pd.DataFrame()
df_2 = pd.DataFrame()
df_3 = pd.DataFrame()
df = pd.concat([df_1, df_2, df_3],ignore_index=True)
print (df)
Empty DataFrame
Columns: []
Index: []
df_1 = pd.DataFrame()
df_2 = pd.DataFrame({'a':[1,2]})
df_3 = pd.DataFrame({'a':[10,20]})
df = pd.concat([df_1, df_2, df_3],ignore_index=True)
print (df)
a
0 1
1 2
2 10
3 20
df_1 = pd.DataFrame()
df_2 = pd.DataFrame({'a':[1,2]})
df_3 = pd.DataFrame()
df = pd.concat([df_1, df_2, df_3],ignore_index=True)
print (df)
a
0 1
1 2

Python - Pandas: perform column value based data grouping across separate dataframe chunks

I was handling a large csv file, and came across this problem. I am reading in the csv file in chunks and want to extract sub-dataframes based on values for a particular column.
To explain the problem, here is a minimal version:
The CSV (save it as test1.csv, for example)
1,10
1,11
1,12
2,13
2,14
2,15
2,16
3,17
3,18
3,19
3,20
4,21
4,22
4,23
4,24
Now, as you can see, if I read the csv in chunks of 5 rows, the first column's values will be distributed across the chunks. What I want to be able to do is load in memory only the rows for a particular value.
I achieved it using the following:
import pandas as pd
list_of_ids = dict() # this will contain all "id"s and the start and end row index for each id
# read the csv in chunks of 5 rows
for df_chunk in pd.read_csv('test1.csv', chunksize=5, names=['id','val'], iterator=True):
#print(df_chunk)
# In each chunk, get the unique id values and add to the list
for i in df_chunk['id'].unique().tolist():
if i not in list_of_ids:
list_of_ids[i] = [] # initially new values do not have the start and end row index
for i in list_of_ids.keys(): # ---------MARKER 1-----------
idx = df_chunk[df_chunk['id'] == i].index # get row index for particular value of id
if len(idx) != 0: # if id is in this chunk
if len(list_of_ids[i]) == 0: # if the id is new in the final dictionary
list_of_ids[i].append(idx.tolist()[0]) # start
list_of_ids[i].append(idx.tolist()[-1]) # end
else: # if the id was there in previous chunk
list_of_ids[i] = [list_of_ids[i][0], idx.tolist()[-1]] # keep old start, add new end
#print(df_chunk.iloc[idx, :])
#print(df_chunk.iloc[list_of_ids[i][0]:list_of_ids[i][-1], :])
print(list_of_ids)
skip = None
rows = None
# Now from the file, I will read only particular id group using following
# I can again use chunksize argument to read the particular group in pieces
for id, se in list_of_ids.items():
print('Data for id: {}'.format(id))
skip, rows = se[0], (se[-1] - se[0]+1)
for df_chunk in pd.read_csv('test1.csv', chunksize=2, nrows=rows, skiprows=skip, names=['id','val'], iterator=True):
print(df_chunk)
Truncated output from my code:
{1: [0, 2], 2: [3, 6], 3: [7, 10], 4: [11, 14]}
Data for id: 1
id val
0 1 10
1 1 11
id val
2 1 12
Data for id: 2
id val
0 2 13
1 2 14
id val
2 2 15
3 2 16
Data for id: 3
id val
0 3 17
1 3 18
What I want to ask is, do we have a better way of doing this? If you consider MARKER 1 in the code, it is bound to be inefficient as the size grows. I did save memory usage, but, time still remains a problem. Do we have some existing method for this?
(I am looking for complete code in answer)
I suggest you use itertools for this, as follows:
import pandas as pd
import csv
import io
from itertools import groupby, islice
from operator import itemgetter
def chunker(n, iterable):
"""
From answer: https://stackoverflow.com/a/31185097/4001592
>>> list(chunker(3, 'ABCDEFG'))
[['A', 'B', 'C'], ['D', 'E', 'F'], ['G']]
"""
iterable = iter(iterable)
return iter(lambda: list(islice(iterable, n)), [])
chunk_size = 5
with open('test1.csv') as csv_file:
reader = csv.reader(csv_file)
for _, group in groupby(reader, itemgetter(0)):
for chunk in chunker(chunk_size, group):
g = [','.join(e) for e in chunk]
df = pd.read_csv(io.StringIO('\n'.join(g)), header=None)
print(df)
print('---')
Output (partial)
0 1
0 1 10
1 1 11
2 1 12
---
0 1
0 2 13
1 2 14
2 2 15
3 2 16
---
0 1
0 3 17
1 3 18
2 3 19
3 3 20
---
...
This approach will read first in groups by column 1:
for _, group in groupby(reader, itemgetter(0)):
and each group will be read in chunks of 5 rows (this can be change using chunk_size):
for chunk in chunker(chunk_size, group):
The last part:
g = [','.join(e) for e in chunk]
df = pd.read_csv(io.StringIO('\n'.join(g)), header=None)
print(df)
print('---')
creates a suitable string to be pass to pandas.

Pandas checks with prefix and more checksum if searched prefix exists or no data

I have below code snippet which works fine.
import pandas as pd
import numpy as np
prefixes = ['sj00', 'sj12', 'cr00', 'cr08', 'eu00', 'eu50']
df = pd.read_csv('new_hosts', index_col=False, header=None)
df['prefix'] = df[0].str[:4]
df['grp'] = df.groupby('prefix').cumcount()
df = df.pivot(index='grp', columns='prefix', values=0)
df['sj12'] = df['sj12'].str.extract('(\w{2}\d{2}\w\*)', expand=True)
df = df[ prefixes ].dropna(axis=0, how='all').replace(np.nan, '', regex=True)
df = df.rename_axis(None)
Example File new_hosts
sj000001
sj000002
sj000003
sj000004
sj124000
sj125000
sj126000
sj127000
sj128000
sj129000
sj130000
sj131000
sj132000
cr000011
cr000012
cr000013
cr000014
crn00001
crn00002
crn00003
crn00004
euk000011
eu0000012
eu0000013
eu0000014
eu5000011
eu5000013
eu5000014
eu5000015
Current output:
sj00 sj12 cr00 cr08 eu00 eu50
sj000001 cr000011 crn00001 euk000011 eu5000011
sj000002 cr000012 crn00002 eu0000012 eu5000013
sj000003 cr000013 crn00003 eu0000013 eu5000014
sj000004 cr000014 crn00004 eu0000014 eu5000015
What's expected:
1) As code works fine but as you see the current output the second column don't have any values but still appearing So, how could i have a checksum if a particular column don't have any values then remove that from display.
2) Can we place a check for the prefixes if they exists in the dataframe before processing to avoid the error.
Appreciate any help.
IIUC, before
df = df[ prefixes ].dropna(axis=0, how='all').replace(np.nan, '', regex=True)
you can do:
# remove all empty columns
df = df.dropna(axis=1, how='all')
That would solve your first part. Second part can be reindex?
# select prefixes:
prefixes = ['sj00', 'sj12', 'cr00', 'cr08', 'eu00', 'eu50', 'sh00', 'dt00', 'sh00', 'dt00']
df = df.reindex(prefixes, axis=1).dropna(axis=1, how='all').replace(np.nan, '', regex=True)
Note the axis=1, not axis=0 is identical to what I propose for question 1.
Much thanks to Quang Hoang for the hints on the post, Just for the workaround, i got it working as follows until i get a better answer:
# Select prefixes
prefixes = ['sj00', 'sj12', 'cr00', 'cr08', 'eu00', 'eu50']
df = pd.read_csv('new_hosts', index_col=False, header=None)
df['prefix'] = df[0].str[:4]
df['grp'] = df.groupby('prefix').cumcount()
df = df.pivot(index='grp', columns='prefix', values=0)
df = df[prefixes]
# For column `sj12` only extract the values having `sj12` and a should be a word immediately after that like `sj12[a-z]`
df['sj12'] = df['sj12'].str.extract('(\w{2}\d{2}\w\*)', expand=True)
df.replace('', np.nan, inplace=True)
# Remove the empty columns
df = df.dropna(axis=1, how='all')
# again drop if all values in the row are nan and replace nan to empty for live columns
df = df.dropna(axis=0, how='all').replace(np.nan, '', regex=True)
# drop the index field
df = df.rename_axis(None)
print(df)

Subtract a single value from columns in pandas

I have two data frames, df and df_test. I am trying to create a new dataframe for each df_test row that will include the difference between x coordinates and the y coordinates. I wold also like to create a new column that gives the magnitude of this distance between objects. Below is my code.
import pandas as pd
import numpy as np
# Create Dataframe
index_numbers = np.linspace(0, 10, 11, dtype=np.int)
index_ = ['OP_%s' % number for number in index_numbers]
header = ['X', 'Y', 'D']
# print(index_)
data = np.round_(np.random.uniform(low=0, high=10, size=(len(index_), 3)), decimals=0)
# print(data)
df = pd.DataFrame(data=data, index=index_, columns=header)
df_test = df.sample(3)
# print(df)
# print(df_test)
for index, row in df_test.iterrows():
print(index)
print(row)
df_(index) = df
df_(index)['X'] = df['X'] - df_test['X'][row]
df_(index)['Y'] = df['Y'] - df_test['Y'][row]
df_(index)['Dist'] = np.sqrt(df_(index)['X']**2 + df_(index)['Y']**2)
print(df_(index))
Better For Loop
for index, row in df_test.iterrows():
# print(index)
# print(row)
# print("df_{0}".format(index))
df_temp = df.copy()
df_temp['X'] = df_temp['X'] - df_test['X'][index]
df_temp['Y'] = df_temp['Y'] - df_test['Y'][index]
df_temp['Dist'] = np.sqrt(df_temp['X']**2 + df_temp['Y']**2)
print(df_temp)
I have written a for loop to run through each row of the df_test dataframe and "try" to create the columns. The (index) in each loop is the name of the new data frame based on test row used. Once the dataframe is created with the modified and new columns I would need to save the data frames to a dictionary. The new loop produces the each of the new dataframes I need but what is the best way to save each new dataframe? Any help in creating these columns would be greatly appreciated.
Please comment with any questions so that I can make it easier to understand, if need be.

Calling a data frame via a string

I have a list of countries such as:
country = ["Brazil", "Chile", "Colombia", "Mexico", "Panama", "Peru", "Venezuela"]
I created data frames using the names from the country list:
for c in country:
c = pd.read_excel(str(c + ".xls"), skiprows = 1)
c = pd.to_datetime(c.Date, infer_datetime_format=True)
c = c[["Date", "spreads"]]
Now I want to be able to merge all the countries data frames using the columns date as the key. The idea is to create a loop like the following:
df = Brazil #this is the first dataframe, which also corresponds to the first element of the list country.
for i in range(len(country)-1):
df = df.merge(country[i+1], on = "Date", how = "inner")
df.set_index("Date", inplace=True)
I got the error ValueError: can not merge DataFrame with instance of type <class 'str'>. It seems python is not calling the data frame which the name is in the country list. How can I call those data frames starting from the country list?
Thanks masters!
Your loop doesn't modify the contents of the country list, so country is still a list of strings.
Consider building a new list of dataframes and looping over that:
country_dfs = []
for c in country:
df = pd.read_excel(c + ".xls", skiprows=1)
df = pd.to_datetime(df.Date, infer_datetime_format=True)
df = df[["Date", "spreads"]]
# add new dataframe to our list of dataframes
country_dfs.append(df)
then to merge,
merged_df = country_dfs[0]
for df in country_dfs[1:]:
merged_df = merged_df.merge(df, on='Date', how='inner')
merged_df.set_index('Date', inplace=True)

Resources