How to drop some parts of a text in a column - python-3.x

I know that this question must have been addressed but I don't seem to find the answer.
I have a column in my dataframe and I want to drop some parts of a string from a specified character. The string is 'WD-2020-04-115R:WD-2020-03-111'. I want everything gone starting from R such that I remain with WD-2020-04-115. For any string in my column without an R in it, I want to keep it

Try:
data_array = ['WD-2020-04-115R:WD-2020-03-111', 'WD-2020-05-10582', 'WD-2020-05-10575', 'WD-2020-05-10576','WD-2020-05-10574', 'WD-2020-05-10571R:WD-2020-03-10563', 'WD-2020-05-10577', 'WD-2020-04-10571R:WD-2020-03-10562']
for data in data_array:
t = data.find('R')
if t < 0:
dropped = data
else:
dropped = data[:t]
print(dropped)
#You can either print, append to an array or write to a file

Related

Extract numbers only from specific lines within a txt file with certain keywords in Python

I want to extract numbers only from lines in a txt file that have a certain keyword and add them up then compare them, and then print the highest total number and the lowest total number. How should I go about this?
I want to print the highest and the lowest valid total numbers
I managed to extract lines with "valid" keyword in them, but now I want to get numbers from this lines, and then add the numbers up of each line, and then compare these numbers with other lines that have the same keyword and print the highest and the lowest valid numbers.
my code so far
#get file object reference to the file
file = open("shelfs.txt", "r")
#read content of file to string
data = file.read()
#close file<br>
closefile = file.close()
#get number of occurrences of the substring in the string
totalshelfs = data.count("SHELF")
totalvalid = data.count("VALID")
totalinvalid = data.count("INVALID")
print('Number of total shelfs :', totalshelfs)
print('Number of valid valid books :', totalvalid)
print('Number of invalid books :', totalinvalid)
txt file
HEADER|<br>
SHELF|2200019605568|<br>
BOOK|20200120000000|4810.1|20210402|VALID|<br>
SHELF|1591024987400|<br>
BOOK|20200215000000|29310.0|20210401|VALID|<br>
SHELF|1300001188124|<br>
BOOK|20200229000000|11519.0|20210401|VALID|<br>
SHELF|1300001188124|<br>
BOOK|20200329001234|115.0|20210331|INVALID|<br>
SHELF|1300001188124|<br>
BOOK|2020032904567|1144.0|20210401|INVALID|<br>
FOOTER|
What you need is to use the pandas library.
https://pandas.pydata.org/
You can read a csv file like this:
data = pd.read_csv('shelfs.txt', sep='|')
it returns a DataFrame object that can easily select or sort your data. It will use the first row as header, then you can select a specific column like a dictionnary:
header = data['HEADER']
header is a Series object.
To select columns you can do:
shelfs = data.loc[:,data['HEADER']=='SHELF']
to select only the row where the header is 'SHELF'.
I'm just not sure how pandas will handle the fact that you only have 1 header but 2 or 5 columns.
Maybe you should try to create one header per colmun in your csv, and add separators to make each row the same size first.
Edit (No External libraries or change in the txt file):
# Split by row
data = data.split('<br>\n')
# Split by col
data = [d.split('|') for d in data]
# Fill empty cells
n_cols = max([len(d) for d in data])
for i in range(len(data)):
while len(data[i])<n_cols:
data[i].append('')
# List VALID rows
valid_rows = [d for d in data if d[4]=='VALID']
# Get sum min and max
valid_sum = [d[1]+d[2]+d[3] for d in valid_rows]
valid_minimum = min(valid_sum)
valid_maximum = max(valid_sum)
It's maybe not exactly what you want to do but it solves a part of your problem. I didn't test the code.

How to make tokenize not treat contractions and their counter parts as the same when comparing two text files?

I am currently working on a data structure that is supposed to compare two text files and make a list of the strings they have in common. my program receives the content of the two files as two strings a & b (one file's content per variable). I then use the tokenize function in a for loop to break the string by each sentence. They are then stored into a set to avoid duplicate entries. I remove all duplicate lines within each variable before I compare them. I then compare each the two variables to each other and only keep the string they have in common. I have a bug that occurs in the last part when they are comparing against each other. The program will treat contractions and their proper counter parts as the same when it should not. For Example it will read Should not and Shouldn't as the same and will produce an incorrect answer. I want to make it not read contraction and their counter parts as the same.
import nltk
def sentences(a, b): #the variables store the contents of the files in the form of strings
a_placeholder = a
set_a = set()
a = []
for punctuation_a in nltk.sent_tokenize(a_placeholder):
if punctuation_a not in set_a:
set_a.add(punctuation_a)
a.append(punctuation_a)
b_placeholder = b
set_b = set()
b = []
for punctuation_b in nltk.sent_tokenize(b_placeholder):
if punctuation_b not in set_b:
set_b.add(punctuation_b)
b.append(punctuation_b)
a_new = a
for punctuation in a_new:
if punctuation not in set_b:
set_a.remove(punctuation)
a.remove(punctuation)
else:
pass
return []

Python iterate over specific column in csv , and replacing values

First sorry for my english ;)
I have a problem regarding a csv file. The file contains a lot of col. with a lot of different features. I want to iterate over the col. host_location to get the entries of each row. For each String which contains ("London" or "london") i want to change the string into an binary. So if the string contains "London" or "london" the entry should be 1 , if not 0.
Im familiar with Java, but Python is new for me.
What i know so far with reference to this problem:
i cant change the csv file directly, i have to read it, change the value and write it back to a new file.
My method so far:
listings = io.read_csv('../data/playground/neu.csv')
def Change_into_Binaryy():
listings.loc[listings["host_location"] == ( "London" or
"london"),"host_location"] = 1
listings.to_csv("../data/playground/neu1.csv",index =False)
The code is from another question of stackoverflow, and im really not familiar with Python so far. The problem is that i can only use the equal operator and not something like contains in java.
As a result only the entries with the string "London" or "london" are changed to 1. But there are also entries like "London, Uk" that i want to change
In addition i don't know how i can change the remaining entries to 0 , because i don't know how i can combine the .loc with sth. like a if/else construct
I also tried another solution:
def Change_into_Binary():
for x in listings['host_location']:
if "London" or "london" in x:
x = 1
else:
x = 0
listings.to_csv("../data/playground/neu1.csv",index =False)
But also do not work. In this case the entries are not changed.
Thanks for you answers
from csv import DictReader, DictWriter
with open('infile.csv', 'r') as infile, open('outfile.csv', 'w') as outfile:
reader = DictReader(infile)
writer = DictWriter(outfile, fieldnames=reader.fieldnames)
writer.writeheader()
for row in reader:
if row['host_location'].capitalize() == 'London':
row['host_location'] = 1
else:
row['host_location'] = 0
writer.writerow(row)

Is it possible to create a new column for each iteration in XlsxWriter

I want to write data into Excel columns using XlsxWriter. One 'set' of data gets written for each iteration. Each set should be written in a separate column. How do I do this?
I have tried playing around with the col value as follows:
At [1] I define i=0 outside the loop and later increment it by 1 and set col=i. When this is done output is blank. To me this is the most logical solution & I don't know why it won't work.
At [2] i is defined inside the loop. When this happens one column gets written.
At [3] I define col the standard way. This works as expected: One column gets written.
My code:
import xlsxwriter
txt_file = open('folder/txt_file','r')
lines = dofile.readlines()
# [1]Define i outside the loop. When this is used output is blank.
i = 0
for line in lines:
if condition_a is met:
#parse text file to find a string. reg_name = string_1.
elif condition_b:
#parse text file for a second string. esto_name = string_2.
elif condition_c:
#parse text file for a group of strings.
# use .split() to append these strings to a list.
# reg_vars = list of strings.
#[2] Define i inside the loop. When this is used one column gets written. Relevant for [1] & [2].
i+=1 #Increment for each loop
row=1
col=i #Increments by one for each iteration, changing the column.
#[3] #Define col =1. When this is used one column also gets written.
col=1
#Open Excel
book= xlsxwriter.Workbook('folder/variable_list.xlsx')
sheet = book.add_worksheet()
#Write reg_name
sheet.write(row, col, reg_name)
row+=1
#Write esto_name
sheet.write(row, col, esto_name)
row+=1
#Write variables
for variable in reg_vars:
row+=1
sheet.write(row, col, variable)
book.close()
You can use the XlsxWriter write_column() method to write a list of data as a column.
However, in this particular case the issue seems to be that you are creating a new, duplicate, file via xlsxwriter.Workbook() each time you go through the condition_c part of the loop. Therefore the last value of col is used and the entire file is overwritten the next time through the loop.
You should probably move the creation of the xlsx file outside the loop. Probably to the same place you open() the text file.

Skipping over array elements of certain types

I have a csv file that gets read into my code where arrays are generated out of each row of the file. I want to ignore all the array elements with letters in them and only worry about changing the elements containing numbers into floats. How can I change code like this:
myValues = []
data = open(text_file,"r")
for line in data.readlines()[1:]:
myValues.append([float(f) for f in line.strip('\n').strip('\r').split(',')])
so that the last line knows to only try converting numbers into floats, and to skip the letters entirely?
Put another way, given this list,
list = ['2','z','y','3','4']
what command should be given so the code knows not to try converting letters into floats?
You could use try: except:
for i in list:
try:
myVal.append(float(i))
except:
pass

Resources