Arranging in ascending order in text file - python-3.x

So i have a text file which looks like this:
07,12,9201
07,12,9201
06,18,9209
06,18,9209
06,19,9209
06,19,9209
07,11,9201
I first want to remove all the duplicate lines, then sort column 1 in ascending order and then sort column 2 in ascending order given column 1 is still in ascending order.
output:
06,18,9209
06,19,9209
07,11,9201
07,12,9201
I have tried this so far:
with open('abc.txt') as f:
lines = [line.split(' ') for line in f]
Consider another example:
00,0,6098
00,1,6098
00,3,6098
00,4,6094
00,5,6094
00,6,6094
00,7,6094
00,8,6094
00,9,6498
00,2,6098
00,20,6102
00,21,6087
00,22,6087
00,23,6087
00,3,6098
00,4,6094
00,5,6094
00,6,6094
00,7,6094
00,8,6094
00,9,6498
The output for this file should be:
00,0,6098
00,1,6098
00,2,6098
00,3,6098
00,4,6094
00,5,6094
00,6,6094
00,7,6094
00,8,6094
00,9,6498
00,20,6102
00,21,6087
00,22,6087
00,23,6087

You can do something like below.
from itertools import groupby, chain
from collections import OrderedDict
input_file = 'input_file.txt'
# Collecting lines
lines = [tuple(line.strip().split(',')) for line in open(input_file)]
# Removing dups and Sorting by first column
sorted_lines = sorted(set(lines), key=lambda x: int(x[0]))
# Grouping and ordering by second column
result = OrderedDict()
for k, g in groupby(sorted_lines, key=lambda x: x[0]):
result[k] = sorted(g, key = lambda x : int(x[1]))
print(result)
for v in chain(*result.values()):
print(','.join(v))
Output 1:
06,18,9209
06,19,9209
07,11,9201
07,12,9201
Output 2:
00,0,6098
00,1,6098
00,2,6098
00,3,6098
00,4,6094
00,5,6094
00,6,6094
00,7,6094
00,8,6094
00,9,6498
00,20,6102
00,21,6087
00,22,6087
00,23,6087

Related

Replace items like A2 as AA in the dataframe

I have a list of items, like "A2BCO6" and "ABC2O6". I want to replace them as A2BCO6--> AABCO6 and ABC2O6 --> ABCCO6. The number of items are much more than presented here.
My dataframe is like:
listAB:
Finctional_Group
0 Ba2NbFeO6
1 Ba2ScIrO6
3 MnPb2WO6
I create a duplicate array and tried to replace with following way:
B = ["Ba2", "Pb2"]
C = ["BaBa", "PbPb"]
for i,j in range(len(B)), range(len(C)):
listAB["Finctional_Group"]= listAB["Finctional_Group"].str.strip().str.replace(B[i], C[j])
But it does not produce correct output. The output is like:
listAB:
Finctional_Group
0 PbPbNbFeO6
1 PbPbScIrO6
3 MnPb2WO6
Please suggest the necessary correction in the code.
Many thanks in advance.
I used for simplicity purpose chemparse package that seems to suite your needs.
As always we import the required packages, in this case chemparse and pandas.
import chemparse
import pandas as pd
then we create a pandas.DataFrame object like in your example with your example data.
df = pd.DataFrame(
columns=["Finctional_Group"], data=["Ba2NbFeO6", "Ba2ScIrO6", "MnPb2WO6"]
)
Our parser function will use chemparse.parse_formula which returns a dict of element and their frequency in a molecular formula.
def parse_molecule(molecule: str) -> dict:
# initializing empty string
molecule_in_string = ""
# iterating over all key & values in dict
for key, value in chemparse.parse_formula(molecule).items():
# appending number of elements to string
molecule_in_string += key * int(value)
return molecule_in_string
molecule_in_string contains the molecule formula without numbers now. We just need to map this function to all elements in our dataframe column. For that we can do
df = df.applymap(parse_molecule)
print(df)
which returns:
0 BaBaNbFeOOOOOO
1 BaBaScIrOOOOOO
2 MnPbPbWOOOOOO
dtype: object
Source code for chemparse: https://gitlab.com/gmboyer/chemparse

Can we get columns names sorted in the order of their tf-idf values (if exists) for each document?

I'm using sklearn TfIdfVectorizer. I'm trying to get the column names in a list in the order of thier tf-idf values in decreasing order for each document? So basically, If a document has all the stop words then we don't need any column names.
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
msg = ["My name is Venkatesh",
"Trying to get the significant words for each vector",
"I want to get the list of words name in the decresasing order of their tf-idf values for each vector",
"is to my"]
stopwords=['is','to','my','the','for','in','of','i','their']
tfidf_vect = TfidfVectorizer(stop_words=stopwords)
tfidf_matrix=tfidf_vect.fit_transform(msg)
pd.DataFrame(tfidf_matrix.toarray(),
columns=tfidf_vect.get_feature_names_out())
I want to generate a column with the list word names in the decreasing order of their tf-idf values
So the column would be like this
['venkatesh','name']
['significant','trying','vector','words','each','get']
['decreasing','idf','list','order','tf','values','want','each','get','name','vector','words']
[] # empty list Since the document consists only stopwords
Above is the primary result I'm looking for, it would be great if we get the sorted dict with tdfidf values as keys and the list of words as values asociated with that tfidf value for each document
So,the result would be like the below
{'0.785288':['venkatesh'],'0.619130':['name']}
{'0.47212':['significant','trying'],'0.372225':['vector','words','each','get']}
{'0.314534':['decreasing','idf','list','order','tf','values','want'],'0.247983':['each','get','name','vector','words']}
{} # empty dict Since the document consists only stopwords
I think this code does what you want and avoids using pandas:
from itertools import groupby
sort_func = lambda v: v[0] # sort by first value in tuple
all_dicts = []
for row in tfidf_matrix.toarray():
sorted_vals = sorted(zip(row, tfidf_vect.get_feature_names()), key=sort_func, reverse=True)
all_dicts.append({val:[g[1] for g in group] for val, group in groupby(sorted_vals, key=sort_func) if val != 0})
You could make it even less readable and put it all in a single comprehension! :-)
The combination of the following function and to_dict() method on dataframe can give you the desired output.
def ret_dict(_dict):
# Get a list of unique values
list_keys = list(set(_dict.values()))
processed_dict = {key:[] for key in list_keys}
# Prepare dictionary
for key, value in _dict.items():
processed_dict[value].append(str(key))
# Sort the keys (as you want)
sorted_keys = sorted(processed_dict, key=lambda x: x, reverse=True)
sorted_keys = [ keys for keys in sorted_keys if keys > 0]
# Return the dictionary with sorted keys
sorted_dict = {k:processed_dict[k] for k in sorted_keys}
return sorted_dict
Then:
res = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vect.get_feature_names_out())
list_dict = res.to_dict('records')
processed_list = []
for _dict in list_dict:
processed_list.append(ret_dict(_dict))
processed_list contains the output you desire. For instance: processed_list[1] would output:
{0.47212002654617047: ['significant', 'trying'], 0.3722248517590162: ['each', 'get', 'vector', 'words']}

Sorting a list of image filenames by number

I am trying to order the following list of image sequences. i.e. frame0.jpg --> frame17.jpg:
I have tried splitting the name, and then sorting them but it doesn't work.
Here's my code:
Data = [
"D:\\shooting_videos\\example\\Output\\trained_framesmake2\\frame0.jpg",
"D:\\shooting_videos\\example\\Output\\trained_framesmake2\\frame1.jpg",
"D:\\shooting_videos\\example\\Output\\trained_framesmake2\\frame10.jpg",
"D:\\shooting_videos\\example\\Output\\trained_framesmake2\\frame11.jpg",
"D:\\shooting_videos\\example\\Output\\trained_framesmake2\\frame12.jpg",
"D:\\shooting_videos\\example\\Output\\trained_framesmake2\\frame13.jpg",
"D:\\shooting_videos\\example\\Output\\trained_framesmake2\\frame14.jpg",
"D:\\shooting_videos\\example\\Output\\trained_framesmake2\\frame15.jpg",
"D:\\shooting_videos\\example\\Output\\trained_framesmake2\\frame16.jpg",
"D:\\shooting_videos\\example\\Output\\trained_framesmake2\\frame17.jpg",
"D:\\shooting_videos\\example\\Output\\trained_framesmake2\\frame2.jpg",
"D:\\shooting_videos\\example\\Output\\trained_framesmake2\\frame3.jpg",
"D:\\shooting_videos\\example\\Output\\trained_framesmake2\\frame4.jpg",
"D:\\shooting_videos\\example\\Output\\trained_framesmake2\\frame5.jpg",
"D:\\shooting_videos\\example\\Output\\trained_framesmake2\\frame6.jpg",
"D:\\shooting_videos\\example\\Output\\trained_framesmake2\\frame7.jpg",
"D:\\shooting_videos\\example\\Output\\trained_framesmake2\\frame8.jpg",
"D:\\shooting_videos\\example\\Output\\trained_framesmake2\\frame9.jpg",
]
sorted_Data = sorted(Data, key=lambda x: int(x.split('.')[0]))
You can use the regex (\d+)\.[a-z]+$, which captures digits immediately before the file name extension, which is assumed to be alphabetical.
import re
sorted(data, key=lambda x: int(re.search(r"(\d+)\.[a-z]+$", x).group(1)))

Question on calculating incoming data from file

If I am reading a data file with some variable, I need to calculate the total numbers of different items by adding them from different lines. For example:
Fruit,Number
banana,25
apple,12
kiwi,29
apple,44
apple,81
kiwi,3
banana,109
kiwi,113
kiwi,68
we would need to add a third variable which is a total of the fruit, and fouth total of all the fruits.
So the output should be like following:
Fruit,Number,TotalFruit,TotalAllFruits
banana,25,25,25
apple,12,12,37
kiwi,29,29,66
apple,44,56,110
apple,81,137,191
kiwi,3,32,194
banana,109,134,303
kiwi,113,145,416
kiwi,68,213,484
I was able to get the first 2 columns printed, but having problem with the last 2 columns
import sys
import re
f1 = open("SampleInput.csv", "r")
f2 = open('SampleOutput.csv', 'a')
sys.stdout = f2
print("Fruit,Number,TotalFruit,TotalAllFruits")
for line1 in f1:
fruit_list = line1.split(',')
exec("%s = %d" % (fruit_list[1], 0))
print(fruit_list[0] + ',' + fruit_list[1])
I am just learning python, so I want to apologize in advance if I am missing something very simple.
You need to declare a 2d-array to keep the values read from the input file.
And during the loop, you need to read the value from previous lines, and then calculate the value of the current line.
And print the 2d-array after all input lines read.
I would recommend you to use pandas library as it makes your process easier
import pandas as pd
df1 = pd.read_csv("SampleInput.csv",sep=",")
df2 = pd.DataFrame()
for index, row in df1.iterrows():
# change the above to what ever you need
df2['Totalsum'] = df1['TotalFruit'] + df1['TotalAllFruits']
df2['Fruit'] = df1['Fruit']
df2.to_csv('SampleOutput.csv',sep=",")
df2 format :
Fruit | Totalsum |
---------------------
Name | Sum |
---------------------
Feel free to change the number of columns to your needs and add your custom logic.

dumping column data from data frames to list in python

import numpy as np
import pandas as pd
def ExtractCsv( Start,End):
lsta,lstb,lstc,lstd= list(),list(),list(),list()
j=0;
for i in range(Start,End+1):
f1 = pd.read_csv('C:/Users/sanilm/Desktop/Desktop_backup/opc/csv/Newfolder/fault_data.tar/fault_data/test/healthy'+ str(i)+'.csv');
listc=list(f1['c'])
listd=list(f1['d'])
liste=list(f1['e'])
listf=list(f1['f'])
lsta.append(listc)
lstb.append(listd)
lstc.append(liste)
lstd.append(listf)
print(lsta)
return f1
f1=ExtractCsv(1,3)
input csv file there are 3 files:
a b c d e f
1 10 2901.1 13.915 39.812 6.2647
1 10 2906.1 13.368 42.083 12.945
1 10 2805.3 12.951 42.261 13.398
1 10 3049.2 14.101 43.499 15.237
1 10 2854.8 13.978 42.699 9.1297
expected output:
[2901.1, 2906.1, 2805.3, 3049.2, 2854.8, 2860.9, 2992.9, 2867.1, 2947.6, 2679.4, 2891.2, 2853.8, 2896.4, 3114.6, 3155.3, 2930.2, 2810.0, 2903.5]
but the output am getting is
[[2901.1, 2906.1, 2805.3, 3049.2, 2854.8], [2860.9, 2992.9, 2867.1, 2947.6, 2679.4, 2891.2], [2853.8, 2896.4, 3114.6, 3155.3, 2930.2, 2810.0, 2903.5]]
any suggestions on how can i achieve my expected output
It looks like you just want to flatten your results.
Maybe try this:
Making a flat list out of list of lists in Python
From link above (but there you can look up more information):
flat_list = [item for sublist in l for item in sublist]
In loop you can create list of DataFrames and then concat together.
Also if many columns in file is possible add parameter usecols to read_csv for read only specified columns names:
def ExtractCsv(Start,End):
dfs = []
for i in range(Start,End+1):
path = 'C:/Users/sanilm/Desktop/Desktop_backup/opc/csv/Newfolder/fault_data.tar/fault_data/test/healthy'+ str(i)+'.csv'
f1 = pd.read_csv(path, usecols=['c','d','e','f'])
dfs.append(f1)
return pd.concat(dfs, ignore_index=True)
df = ExtractCsv(1,3)
Last if need extract some column to list:
lsta = df['c'].tolist()

Resources