how to search a text file in python 3 - python-3.x

I have this text file that has lists in it. How would I search for that individual list? I have tried using loops to find it, but every time it gives me an error since I don't know what to search for.
I tried using a if statement to find it but it returns -1.
thanks for the help

I was doing research on this last night. You can use pandas for this. See here: Load data from txt with pandas. One of the answers talks about list in text files.
You can use:
data = pd.read_csv('output_list.txt', sep=" ", header=None)
data.columns = ["Name", "b", "c", "etc."]
Add sep=" " in your code, leaving a blank space between the quotes. So pandas can detect spaces between values and sort in columns. Data columns isenter code here for naming your columns.
With a JSON or XML format, text files become more searchable. In my research I’ve decided to go with an XML approach. Here is the link to a blog that explains how do use Python with XML: http://www.austintaylor.io/lxml/python/pandas/xml/dataframe/2016/07/08/convert-xml-to-pandas-dataframe.
If you want to search the data frame try:
import pandas as pd
txt_file = 'C:\path\to\your\txtfile.txt'
df = pd.read_table(txt_file, sep = ",")
row = df.loc[df['Name'] == 'bob']
Print(row)
Now depending how your text file is formated, your results will not work for every text file. The idea of a dataframe in pandas helps u create a CSV file formats. This giving the process a repeatable structure to enable testing results. Again I recommend using a JSON or XML format before implementing pandas data frames in ur solution. U can then create a consistent result, that is testable too!

Related

Can you read a CSV file as one column?

I know this sounds silly, but is it possible to read a CSV file containing multiple columns and combine all the data into one column? Let's say I have a CSV file with 6 columns and they have different delimiters. Is it possible to read these files, but spit out the first 100 rows into one column, without specifying a delimiter? My understanding is that this isn't possible if using pandas.
I don't know if this helps, but to add context to my question, I'm trying to use Treeview from Tkinter to display the first 100 rows of a CSV file. The Treeview window should display this data as 1 column if a delimiter isn't specified. Otherwise, it will automatically split the data based on a delimiter from the user input.
This is the data I have:
This should be the result:
Pandas isn't the only way to read a CSV file. There is also the built in csv module in the python standard lib as well as the basic builtin function open that will work just as well. Both of these methods can generate single rows of data like your question indicates.
Using open function
filepath = "/path/to/file.csv"
with open(filepath, "rt", encoding="utf-8") as fd:
header = next(fd)
for row in fd:
# .... do something with row data
# row will be a string of all the data for a single row.
# example: "Information,44775.4541667,MicrosoftWindowsSecurity,16384..."
# then you can break at any time you want to stop reading.
or using the csv module:
import csv
reader = csv.reader(open("/path/to/file.csv", "rt", encoding="utf8"), delimeter=',')
header = next(reader)
for row in reader:
# this time the row will be a list split by the delimiter which
# by default is a comma but you can change it in the call to the reader
you can use
open('file.csv') as f: data=list(f.readlines())
to read file line by line
As other answers have explained, you can use various ways to read first n-lines of text from a file. But if you insist on using pandas then there is a trick you can use.
Find a character which will never appear in your text and use it as dummy delimiter to read_csv(), so that all text will be read as one column. Use nrows parameter to control number of lines to read:
pd.read_csv("myfile.csv", sep="~", nrows=100)

Get word definition/s from google translator using Python

please, I am making my own dictionary and cant figure out how to pull translation definitions from google translate. My idea is that python will open my excel file and in every cell in column 1 is a new word. python will take every single one simultaneously. translate it from English to Slovak by using google translator and don't take just the translated word, but rather its definition/s (if there's more than one definition, take them all) and the group of the definition (noun, adverb, verb, ...) and then add these data in the excel table either in a new cell next to the original translated word or if more definitions, add rows for every definition.
I'm new to this so please excuse me.
To be able to satisfy your requirements. A way to do this is to do the following in your script:
You can use pandas.read_excel to read your excel file and do some data manipulation to get all values in your column 1.
When you got your values to translate you can use something like googletrans which uses Google Translate on the back end or use the paid Google Translation API to handle your translations. But based from your requirements, I suggest using the Google Translation API since it is capable of returning all possible definitions.
When you get your translations, it is up to you to transform your data so you can add them as a new column on your original excel file. You can use pandas.ExcelWriter for this.
I made this simple script that reads a CSV file (I don't have excel installed in my machine), translates everything under text column and puts them to the translated column. It's up to you if you process the data differently.
NOTE for the script below:
I used the Google Translation API which is the paid service
Use pd.read_excel() to read excel files
Adjust the column number based from your input file
sample_data.csv:
text
dummy_field
run
dummy1
how are you
dummy2
jump
dummy3
Sample script:
import pandas as pd
from google.cloud import translate_v2 as translate
def translate_text(text):
translate_client = translate.Client()
target = 'tl'
result = translate_client.translate(text, target_language = target)
return result["translatedText"]
def process_data(input_file):
#df = pd.read_excel('test.xlsx', engine='openpyxl')
df = pd.read_csv(input_file)
df['translated'] = df['text'].apply(translate_text)
# move column 'translated' to second column
# this position will depend on your actual data
second_col = df.pop('translated')
df.insert(1, 'translated', second_col)
print(df)
df.to_csv('./updated_data.csv',index=False)
df.to_excel('./updated_data.xlsx',index=False)
process_data('sample_data.csv')
Output:
Dataframe
Generated csv file:
Generated excel file:

comparing multiple tab delimited csv files in python

As a start, I want to compare the first two columns of two .csv files then write what is common in these files to an output file, say common.csv, then also write the differences in each file to different output files, say f1.csv and f4.csv.
So far I have tried to using set(), difflib, and also taking the two files, create lists from the files then comparing the first two columns in each file. This gave me the output for what is common but not for what the differences are in each file when compared to each other. I have tried most of the solutions posted that seemed like the problem was similar to mine but I am still stuck. Can someone please assist?
this is the headers in my files and only want to compare the first two columns but write out the entire line to the output file.
fieldnames = (["Chromosome" ,"GenomicPosition", "ReferenceBase",
"AlternateBase", "GeneName", "GeneID",
"TrancriptID", "Varianteffect-Variantimpact",
"Biotype", "TranscriptBiotype" , "Referencebase",
"Alternatebase", "Depth coverage"])
One solution is to use pandas, which is very powerful.
To convert csv <-> pandas dataframes:
import pandas as pd
df = pd.read_csv('csv_file.csv') # csv -> pandas
df.to_csv('csv_file.csv', index=False) # pandas -> csv
To compare pandas dataframes on columns, this post should point you in the right direction: https://stackoverflow.com/a/47107164/2667536

Problem when importing table from pdf to python using tabula

When importing data from pdf using tabula with Python, in some cases, I obtain two or more columns merged in one. It does not happen with all the files obtained from the same pdf.
In this case, this is the code used to read the pdf:
from tabula import wrapper
tables = wrapper.read_pdf("933884 cco Saupa 1.pdf",multiple_tables=True,pages='all')
i=1
for table in tables:
table.to_excel('output'+str(i)+'.xlsx',index=False)
i=i+1
For example, when I print the first item of the dataframe obtained from one of these excel files, named "output_pd":
print (output_pd[0][1])
I obtain:
76) 858000015903708 77) 858000013641969 78)
The five numbers are in a single column, so I cannot treat them individually.
Is it possible to improve the data handling in these cases?
You could try manually editing the data in excel. If you use text to columns under the data tab in excel it allows you to split one column into multiple columns without too much work, but you would need to do it for every excel file which could be a pain.
Iterating in each item of each column of each dataframe in the list obtained with tabula
wrapper.read_pdf(file)
in this case
tables
it is possible to obtain clean data.
In this case:
prueba =[]
i = 0
for table in tables:
for columna in table.columns:
for item in (str(table[columna]).split(" ")):
if "858" in str(item):
prueba.append(item[0:15])
print (prueba[0:5])
result in:
['858000019596025', '858000015903707', '858000013641975', '858000000610864', '858000013428853']
But
tabula.wrapper.read_pdf
does not read the whole initial pdf. 2 values are left in the last page. So, it is still neccesary to manually make a little edit.

Writing pandas data frame to csv but no space between columns- sep'\t' argument is ignored in Python 3

I have a problem and found many related questions asked here and read them all, but still can`t solve it. So far I didn't get any answer.
I have two files one is .csv and the other is .xlsx. They have a different number of rows and columns. I would like to merge these two according to filenames. Very simplified the two files look like as follows;
The csv file;
the excel file;
First i converted them to panda data frame;
import pandas as pd
import csv,xlrd
df1 = pd.read_csv('mycsv.csv')
df2=pd.read_excel(myexcel.xlsx', sheetname=0)
To merge the two files on the same column I remove the white space in column names in df2 using the first line below and, then I merge them and print the merged data frame in csv file.
df2.columns=df2.columns.str.replace(' ', '')
df=pd.merge(df1, df2, on="filename")
df.to_csv('myfolder \\merged_file.csv', sep="\t ")
When I check my folder, I see merged_file.csv exists but when I opened it there is no space between columns and values. I want to see nice normal csv or excel look, like my example files above. Just to make sure I tried everything, I also converted the Excel file to a csv file and then merged two csv but still merged data is written without spaces. Again, the above files are very simplified, but my real merged data look like this;
Finally, figured it out. I am putting the answer here just in case if anyone else also manages the same mistake as me. Just remove the sep="\t" and use below line instead;
df.to_csv('myfolder \\merged_file.csv')
Just realized the two csv files were comma separated and using tab delimiter for merge didn`t work.

Resources