How to read starting N words from each rows in python3 - python-3.x

I am reading excel which has free text in a column.Now after reading that file from pandas, I want to restrict the column having text to read just N words from starting for each rows. I tried everything but was not able to make it.
data["text"] = I am going to school and I bought something from market.
But I just want to read staring 5 words. so that it could look like below.
data["text"] = I am going to school.
and I want this same operation to be done bow each row for data["text"] column.
You help will be highly appreciated.

def first_k(s: str, k=5) -> str:
s = str(s) # just in case something like NaN tries to sneak in there
first_words = s.split()[:k]
return ' '.join(first_words)
Then, apply the function:
data['text'] = data['text'].apply(first_k)

data["text"] = [' '.join(s.split(' ')[:5]) for s in data["text"].values]

Related

How to split compound word in pandas?

I have and document that consist of many compounds (or sometimes combined) word as:
document.csv
index text
0 my first java code was helloworld
1 my cardoor is totally broken
2 I will buy a screwdriver to fix my bike
As seen above some words are combined or compound and I am using compound word splitter from here to fix this issue, however, I have trouble to apply it in each row of my document (like pandas series) and convert the document into a clean form of:
cleanDocument.csv
index text
0 my first java code was hello world
1 my car door is totally broken
2 I will buy a screw driver to fix my bike
(I am aware of word such as screwdriver should be together, but my goal is cleaning the document). If you have a better idea for splitting only combined words, please let me know.
splitter code may works as:
import pandas as pd
import splitter ## This use enchant dict (pip install enchant requires)
data = pd.read_csv('document.csv.csv')
then, it should use:
splitter.split(data) ## ???
I already looked into something like this but this not work in my case. thanks
You use apply wit axis =1 : Can you try the following
data.apply(lambda x: splitter.split(j) for j in (x.split()), axis = 1)
I do not have splitter installed on my system. By looking at the link you have provided, I have this following code. Can you try:
def handle_list(m):
ret_lst = []
L = m['text'].split()
for wrd in L:
g = splitter.split(wrd)
if g :
ret_lst.extend(g)
else:
ret_lst.append(wrd)
return ret_lst
dft.apply(handle_list, axis = 1)

How to save tuples output form for loop to DataFrame Python

I have some data 33k rows x 57 columns.
In some columns there is a data which I want to translate with dictionary.
I have done translation, but now I want to write back translated data to my data set.
I have problem with saving tuples output from for loop.
I am using tuples for creating good translation. .join and .append is not working in my case. I was trying in many case but without any success.
Looking for any advice.
data = pd.read_csv(filepath, engine="python", sep=";", keep_default_na=False)
for index, row in data.iterrows():
row["translated"] = (tuple(slownik.get(znak) for znak in row["1st_service"]))
I just want to see in print(data["1st_service"] a translated data not the previous one before for loop.
First of all, if your csv doesn't already have a 'translated' column, you'll have to add it:
import numpy as np
data['translated'] = np.nan
The problem is the row object you're trying to write to is only a view of the dataframe, it's not the dataframe itself. Plus you're missing square brackets for your list comprehension, if I'm understanding what you're doing. So change your last line to:
data.loc[index, "translated"] = tuple([slownik.get(znak) for znak in row["1st_service"]])
and you'll get a tuple written into that one cell.
In future, posting the exact error message you're getting is very helpful!
I have manage it, below working code:
data = pd.read_csv(filepath, engine="python", sep=";", keep_default_na=False)
data.columns = []
slownik = dict([ ])
trans = ' '
for index, row in data.iterrows():
trans += str(tuple([slownik.get(znak) for znak in row["1st_service"]]))
data['1st_service'] = trans.split(')(')
data.to_csv("out.csv", index=False)
Can you tell me if it is well done?
Maybe there is an faster way to do it?
I am doing it for 12 columns in one for loop, as shown up.

Wrong format csv output in python

i have the following code that is working 99% - its taken me various attempts to get it right:
w = csv.writer(filename, lineterminator="\n")
sC = []
for i in sOut:
#print("save", i[1:])
sC.append(i[1:], "\n") #slice away first part
sP = self.ids(sC)
w.writerow(sP)
filename.close()
print("You save ", filename) #To show on CLI
def ids(self, numbering):
tally = 1
for i in range(len(numbering)):
id = str(tally)
numbering[i].insert(0, id)
tally = tally+1
return(numbering)
The out put it should return inside a CSV file should look like this i.e. in separate columns:
1 -4.885276794 55.72986221
2 -4.885276794 55.72958374
3 -4.883611202 55.72958374
Instead it returns it all in one row and with square brackets, commas and aprostophes all of which I do not want:
['1', -4.88527679443359, 55.7298622131348] ['2', -4.7475008964538, 55.9473609924319] ['3', -4.79416608810425, 56.02791595459]
I know I am making some basic mistake somewhere however I just don't know what? All help will be greatly appreciated.
Thanks Jemma
As per juanpa should be w.writerows(saveProcess) instead of w.writerow(saveProcess)
all works!

Python adding lists

I am importing data from a file, which is working correctly. I have appended the data from this file into 3 different lists, name, mark, mark2 although I don't understand how or if i can make a new list called total_marks and add a calculation appending mark + mark2 into total_marks. Tried looking about for help on this and couldn't find much relating to it. The plan is to actually add the two lists together and work out a percentage which the total marks would be 150.
To add the two lists item by item:
combined = []
for m1, m2 in zip(mark, mark2):
combined.append(m1+m2)
The zip function returns an item pair from the two lists for each pair in the lists.:
https://docs.python.org/3/library/functions.html#zip
Then you can perform the final operation this way:
final = []
for m in combined:
final.append(m/150*100)
As I said in my comment, I highly recommend that once you've gotten past learning the basics that you then take the time to learn two libraries: pandas and xlwings. These will greatly help your ability to interact between python and excel. An operation like you have here becomes much simpler once you learn pandas dataframes.
Here is a better way, using pandas.
import pandas
df = pandas.read_csv('Classmarks.csv', index_col = 'student_name', names = ('student_name', 'mark1', 'mark2'), header = None)
df['combined'] = df['mark1'] + df['mark2']
df['final'] = df['combined'] / 150 * 100
print(df)
Don't have to do any loops using pandas. And you can then write it back to a csv file:
df.to_csv('Classmarksout.csv')

Generators for processing large result sets

I am retrieving information from a sqlite DB that gives me back around 20 million rows that I need to process. This information is then transformed into a dict of lists which I need to use. I am trying to use generators wherever possible.
Can someone please take a look at this code and suggest optimization please? I am either getting a “Killed” message or it takes a really long time to run. The SQL result set part is working fine. I tested the generator code in the Python interpreter and it doesn’t have any problems. I am guessing the problem is with the dict generation.
EDIT/UPDATE FOR CLARITY:
I have 20 million rows in my result set from my sqlite DB. Each row is of the form:
(2786972, 486255.0, 4125992.0, 'AACAGA', '2005’)
I now need to create a dict that is keyed with the fourth element ‘AACAGA’ of the row. The value that the dict will hold is the third element, but it has to hold the values for all the occurences in the result set. So, in our case here, ‘AACAGA’ will hold a list containing multiple values from the sql result set. The problem here is to find tandem repeats in a genome sequence. A tandem repeat is a genome read (‘AACAGA’) that is repeated atleast three times in succession. For me to calculate this, I need all the values in the third index as a list keyed by the genome read, in our case ‘AACAGA’. Once I have the list, I can subtract successive values in the list to see if there are three consecutive matches to the length of the read. This is what I aim to accomplish with the dictionary and lists as values.
#!/usr/bin/python3.3
import sqlite3 as sql
sequence_dict = {}
tandem_repeat = {}
def dict_generator(large_dict):
dkeys = large_dict.keys()
for k in dkeys:
yield(k, large_dict[k])
def create_result_generator():
conn = sql.connect('sequences_mt_test.sqlite', timeout=20)
c = conn.cursor()
try:
conn.row_factory = sql.Row
sql_string = "select * from sequence_info where kmer_length > 2"
c.execute(sql_string)
except sql.Error as error:
print("Error retrieving information from the database : ", error.args[0])
result_set = c.fetchall()
if result_set:
conn.close()
return(row for row in result_set)
def find_longest_tandem_repeat():
sortList = []
for entry in create_result_generator():
sequence_dict.setdefault(entry[3], []).append(entry[2])
for key,value in dict_generator(sequence_dict):
sortList = sorted(value)
for i in range (0, (len(sortList)-1)):
if((sortList[i+1]-sortList[i]) == (sortList[i+2]-sortList[i+1])
== (sortList[i+3]-sortList[i+2]) == (len(key))):
tandem_repeat[key] = True
break
print(max(k for k, v in tandem_repeat.items() if v))
if __name__ == "__main__":
find_longest_tandem_repeat()
I got some help with this on codereview as #hivert suggested. Thanks. This is much better solved in SQL rather than just code. I was new to SQL and hence could not write complex queries. Someone helped me out with that.
SELECT *
FROM sequence_info AS middle
JOIN sequence_info AS preceding
ON preceding.sequence_info = middle.sequence_info
AND preceding.sequence_offset = middle.sequence_offset -
length(middle.sequence_info)
JOIN sequence_info AS following
ON following.sequence_info = middle.sequence_info
AND following.sequence_offset = middle.sequence_offset +
length(middle.sequence_info)
WHERE middle.kmer_length > 2
ORDER BY length(middle.sequence_info) DESC, middle.sequence_info,
middle.sequence_offset;
Hope this helps someone with around the same idea. Here is a link to the thread on codereview.stackexchange.com

Resources