I have a csv file list. Have to trim the last 10 values in python. Below is my code.
s=['SSP_AP_INVOICE_DISTRIBUTIONS_17022023072701.csv','SSP_AP_INVOICE_16022023072701.csv','SSP_HR_ALL_ORGANIZATION_UNITS_18012023043243.CSV']
import datetime as dt
for i in filename:
date_str = i.split('_')[4]
x=date_str[:8]
print(x)
But in my list, few _ have 3rd,4th,5th positions. But I am able to remove from the 4th.I want to remove the last 10 values in list including .csv
But I am expecting Output be like:
SSP_AP_INVOICE_DISTRIBUTIONS_17022023
SSP_AP_INVOICE_16022023
SSP_HR_ALL_ORGANIZATION_UNITS_18012023
Related
This question already exists:
Restore the order of mismatched lines of CSV file in Python
Closed 1 year ago.
Given that the number of columns is 3, and the head of the data is correct, the column delimiter is by "<|>", the mismatched lines are due to accidental feed by a new line.
Consider the following CSV file,
PERSON_ID<|>DEPT_ID<|>DATE_JOINED
AAAAA<|>S1<|>2021/01
/03
BBBBBB<|>S2<|>2021/02/03
CCCCC<|>S1<|>2021/03/05
I wish the output like,
enter image description here
The first thing I did is to remove the white spacing in the CSV file.
import re
your_string ="""PERSON_ID<|>DEPT_ID<|>DATE_JOINED
AAAAA<|>S1<|>2021/01
/03
BBBBBB<|>S2<|>2021/02/03
CCCCC<|>S1<|>2021/03/05"""
print(re.sub(r'\s{1,}','',your_string.strip()))
After this step I get tape-like strings:
PERSON_ID<|>DEPT_ID<|>DATE_JOINEDAAAAA<|>S1<|>2021/01/03BBBBBB<|>S2<|>2021/02/03CCCCC<|>S1<|>2021/03/05
Now I need to feed in a correct next line in "2021/01/03BBBBBB".
Assuming the total number of columns is 3, so we need to feed the next line between each:
the 2nd delimiter to 3rd delimiter,
the 4th delimiter to 5th delimiter,
the 6th delimiter to 7th delimiter...and so on.
Assuming the date shown in the string at a fixed length of 10, so I need a new line spacing feed in each designated delimiter after a string length of 10.
Assuming the data head will not change, so I can insert a new line spacing after a string length of 33 from the beginning of the file.
Then, finally, I can get my correct data in lines, the output of the rows in CSV would be like,
PERSON_ID<|>DEPT_ID<|>DATE_JOINED
AAAAA<|>S1<|>2021/01/03
BBBBBB<|>S2<|>2021/02/03
CCCCC<|>S1<|>2021/03/05
After this, I can separate them by the string delimiters. Hence, complete the mismatched lines restoration.
Therefore, I need help on how to insert a next line between the designated delimiters at a string length of 10 from its beginning?
Thanks!
What about getting lines of fields directly? Like that:
sep = '<|>'
your_data = [line.strip().split(sep) for line in your_string.strip().split('\n') if sep in line]
You got:
[['PERSON_ID', 'DEPT_ID', 'DATE_JOINED'], ['AAAAA', 'S1', '2021/01'], ['BBBBBB', 'S2', '2021/02/03'], ['CCCCC', 'S1', '2021/03/05']]
I want to extract numbers only from lines in a txt file that have a certain keyword and add them up then compare them, and then print the highest total number and the lowest total number. How should I go about this?
I want to print the highest and the lowest valid total numbers
I managed to extract lines with "valid" keyword in them, but now I want to get numbers from this lines, and then add the numbers up of each line, and then compare these numbers with other lines that have the same keyword and print the highest and the lowest valid numbers.
my code so far
#get file object reference to the file
file = open("shelfs.txt", "r")
#read content of file to string
data = file.read()
#close file<br>
closefile = file.close()
#get number of occurrences of the substring in the string
totalshelfs = data.count("SHELF")
totalvalid = data.count("VALID")
totalinvalid = data.count("INVALID")
print('Number of total shelfs :', totalshelfs)
print('Number of valid valid books :', totalvalid)
print('Number of invalid books :', totalinvalid)
txt file
HEADER|<br>
SHELF|2200019605568|<br>
BOOK|20200120000000|4810.1|20210402|VALID|<br>
SHELF|1591024987400|<br>
BOOK|20200215000000|29310.0|20210401|VALID|<br>
SHELF|1300001188124|<br>
BOOK|20200229000000|11519.0|20210401|VALID|<br>
SHELF|1300001188124|<br>
BOOK|20200329001234|115.0|20210331|INVALID|<br>
SHELF|1300001188124|<br>
BOOK|2020032904567|1144.0|20210401|INVALID|<br>
FOOTER|
What you need is to use the pandas library.
https://pandas.pydata.org/
You can read a csv file like this:
data = pd.read_csv('shelfs.txt', sep='|')
it returns a DataFrame object that can easily select or sort your data. It will use the first row as header, then you can select a specific column like a dictionnary:
header = data['HEADER']
header is a Series object.
To select columns you can do:
shelfs = data.loc[:,data['HEADER']=='SHELF']
to select only the row where the header is 'SHELF'.
I'm just not sure how pandas will handle the fact that you only have 1 header but 2 or 5 columns.
Maybe you should try to create one header per colmun in your csv, and add separators to make each row the same size first.
Edit (No External libraries or change in the txt file):
# Split by row
data = data.split('<br>\n')
# Split by col
data = [d.split('|') for d in data]
# Fill empty cells
n_cols = max([len(d) for d in data])
for i in range(len(data)):
while len(data[i])<n_cols:
data[i].append('')
# List VALID rows
valid_rows = [d for d in data if d[4]=='VALID']
# Get sum min and max
valid_sum = [d[1]+d[2]+d[3] for d in valid_rows]
valid_minimum = min(valid_sum)
valid_maximum = max(valid_sum)
It's maybe not exactly what you want to do but it solves a part of your problem. I didn't test the code.
import csv
email = 'someone#somemail.com'
password = 'password123'
with open('test.csv', 'a', newline='') as accts:
b = csv.writer(accts, delimiter=',')
b.writerow(email)
b.writerow(password)
I'm trying to append to a csv file with the format email:password on the same row, but everytime I run the program it creates a new row for each letter and the password is written under the email. What am I doing wrong?
Output:
s,o,m,e,o,n,e,#,s,o,m,e,m,a,i,l,.,c,o,m
p,a,s,s,w,o,r,d,1,2,3
Desired output:
someone#somemail.com,password123
A string looks like a list of individual characters, and writerow expects a list of the column values, so you end up with columns of individual characters.
Instead, use a list of the column values:
b.writerow([email,password])
I have more than 1000 csv files , i want to combine where csv filename first five digits are same in to one csv file.
input:
100044566.csv
100040457.csv
100041458.csv
100034566.csv
100030457.csv
100031458.csv
100031459.csv
import pandas as pd
import os
import glob
path_1 =''
all_files_final = glob.glob(os.path.join(path_1, "*.csv"))
names_1 = [os.path.basename(x1) for x1 in all_files_final]
final = pd.DataFrame()
for file_1, name_1 in zip(all_files_final, names_1):
file_df_final = pd.read_csv(file_1,index_col=False)
#file_df['file_name'] = name
final = final.append(file_df_final)
final.to_csv('',index=False)
i used the above code but its merging all files in to one csv file , i dont know have to make selection based on the name
so from above input
output 1: combine first three csv files in one csv file because filename first five digits are same.
output 2: combine next 4 files in one csv files because filename first five digits are same.
I would recommend you to approach the problem slightly differently.
Here's my solution:
import os
import pandas as pd
files = os.listdir('.') # returns list of filenames in current folder
files_of_interest = {} # a dictionary that we will be using in future
for filename in files: # iterate over files in a folder
if filename[-4:] == '.csv': # check whether a file is of .csv format
key = filename[:5] # as you've mentioned in you question - first five characters of filename is of interest
files_of_interest.setdefault(key,[]) #if we dont have such key - .setdefault will create such key for us and assign empy list to it
files_of_interest[key].append(filename) # append to a list new filename
for key in files_of_interest:
buff_df = pd.DataFrame()
for filename in files_of_interest[key]:
buff_df= buff_df.append(pd.read_csv(filename)) # iterate over every filename for specific key in dictionary and appending it to buff_df
files_of_interest[key]=buff_df # replacing list of files by a data frame
This code will create a dictionary of dataframes. Where keys of the dictionary will be a set of first unique characters of .csv files.
Then you can iterate over keys of the dictionary to save every according dataframe as a .csv file.
Hope my answer helped.
I have an excel file (.xlsx) with a column having rows of strings. I used the following code to get the file:
import pandas as pd
df = pd.read_excel("file.xlsx")
db = df['Column Title']
I am removing the punctuation for the first line (row) of the column using this code:
import string
translator = str.maketrans('', '', string.punctuation)
sent_pun = db[0].translate(translator)
I would like to remove the punctuation for each line (until the last row). How would I correctly write this with a loop? Thank you.
Well given that this code is working for one value and producing the right kind of results then you can write it in a loop as
for row in rows(min_row=1, min_col=1, max_row=6, max_col=3):
for cell in row:
translator = str.maketrans('', '', string.punctuation)
sent_pun = db[0].translate(translator)
Change the arguments (number of rows and columns) as per your need.