How to select a specific part in a .csv file on multiple lines? (Python3) - python-3.x

I'm making this example up for the sake of explaining the problem.
If I have a .csv file as follows:
alfa, bravo, Charlie, 1.31
Dragonball, manga, anime, 3.11
delta, Omega, cookie, 3.13
Dragonball, stan, lee, 1.13
How can I pick up the fourth part of each line which has "Dragonball" as the first part? If the list goes on further, and I do not know which lines have the "Dragonball" as the first part.
I have tried:
list = []
for line in file:
line = line.rstrip()
part = line.split(",")
if part[0] == Dragonball:
list.append(part[3])
Expected output:
list = [3.11, 1.13]

You can do it easily using pandas:
import pandas as pd
df = pd.read_csv("path to your csv file")
print(list(df[df[0]=='Dragonball'][3]))
Output:
[3.11, 1.13]

Related

Instead of printing to console create a dataframe for output

I am currently comparing the text of one file to that of another file.
The method: for each row in the source text file, check each row in the compare text file.
If the word is present in the compare file then write the word and write 'present' next to it.
If the word is not present then write the word and write not_present next to it.
so far I can do this fine by printing to the console output as shown below:
import sys
filein = 'source.txt'
compare = 'compare.txt'
source = 'source.txt'
# change to lower case
with open(filein,'r+') as fopen:
string = ""
for line in fopen.readlines():
string = string + line.lower()
with open(filein,'w') as fopen:
fopen.write(string)
# search and list
with open(compare) as f:
searcher = f.read()
if not searcher:
sys.exit("Could not read data :-(")
#search and output the results
with open(source) as f:
for item in (line.strip() for line in f):
if item in searcher:
print(item, ',present')
else:
print(item, ',not_present')
the output looks like this:
dog ,present
cat ,present
mouse ,present
horse ,not_present
elephant ,present
pig ,present
what I would like is to put this into a pandas dataframe, preferably 2 columns, one for the word and the second for its state . I cant seem to get my head around doing this.
I am making several assumptions here to include:
Compare.txt is a text file consisting of a list of single words 1 word per line.
Source.txt is a free flowing text file, which includes multiple words per line and each word is separated by a space.
When comparing to determine if a compare word is in source, is is found if and only if, no punctuation marks (i.e. " ' , . ?, etc) are appended to the word in source .
The output dataframe will only contain the words found in compare.txt.
The final output is a printed version of the pandas dataframe.
With these assumptions:
import pandas as pd
from collections import defaultdict
compare = 'compare.txt'
source = 'source.txt'
rslt = defaultdict(list)
def getCompareTxt(fid: str) -> list:
clist = []
with open(fid, 'r') as cmpFile:
for line in cmpFile.readlines():
clist.append(line.lower().strip('\n'))
return clist
cmpList = getCompareTxt(compare)
if cmpList:
with open(source, 'r') as fsrc:
items = []
for item in (line.strip().split(' ') for line in fsrc):
items.extend(item)
print(items)
for cmpItm in cmpList:
rslt['Name'].append(cmpItm)
if cmpItm in items:
rslt['State'].append('Present')
else:
rslt['State'].append('Not Present')
df = pd.DataFrame(rslt, index=range(len(cmpList)))
print(df)
else:
print('No compare data present')

Read CSV using pandas

I'm trying to read data from https://download.bls.gov/pub/time.series/bp/bp.measure using pandas, like this:
import pandas as pd
url = 'https://download.bls.gov/pub/time.series/bp/bp.measure'
df = pd.read_csv(url, sep='\t')
However, I just need to get the dataset with the two columns: measure_code and measure_text. As this dataset as a title BP measure I also tried to read it like:
url = 'https://download.bls.gov/pub/time.series/bp/bp.measure'
df = pd.read_csv(url, sep='\t', skiprows=1)
But in this case it returns a dataset with just one column and I'm not being able to slipt it:
>>> df.columns
Index([' measure_code measure_text'], dtype='object')
Any suggestion/idea on a better approach to get this dataset?
It's definitely possible, but the format has a few quirks.
As you noted, the column headers start on line 2, so you need skiprows=1.
The file is space-separated, not tab-separated.
Column values are continued across multiple lines.
Issues 1 and 2 can be fixed using skiprows and sep. Problem 3 is harder, and requires you to preprocess the file a little. For that reason, I used a slightly more flexible way of fetching the file, using the requests library. Once I have the file, I use regular expressions to fix problem 3, and give the file back to pandas.
Here's the code:
import requests
import re
import io
import pandas as pd
url = 'https://download.bls.gov/pub/time.series/bp/bp.measure'
# Get the URL, convert the document from DOS to Unix linebreaks
measure_codes = requests.get(url) \
.text \
.replace("\r\n", "\n")
# If there's a linebreak, followed by at least 7 spaces, combine it with
# previous line
measure_codes = re.sub("\n {7,}", " ", measure_codes)
# Convert the string to a file-like object
measure_codes = io.BytesIO(measure_codes.encode('utf-8'))
# Read in file, interpreting 4 spaces or more as a delimiter.
# Using a regex like this requires using the slower Python engine.
# Use skiprows=1 to skip the header
# Use dtype="str" to avoid converting measure code to integer.
df = pd.read_csv(measure_codes, engine="python", sep=" {4,}", skiprows=1, dtype="str")
print(df)

Python 3.9: For loop is not producing output files eventhough no errors are displayed

everyone, I am fairly new to using python for data analysis,so apologies for silly questions:
IDE : PyCharm
What I have : A massive .xyz file (with 4 columns) which is a combination of several datasets, each dataset can be determined by the third column of the file which goes from 10,000 to -10,000 with 0 in between and 100 as spacing and repeats (so every 201 rows is one dataset)
What I want to do : Split the massive file into its individual datasets (201 rows each)and save each file under a different name.
What I have done so far :
# Import packages
import os
import pandas as pd
import numpy as np #For next steps
import math #For next steps
#Check and Change directory
path = 'C:/Clayton/lines/profiles_aufmod'
os.chdir(path)
print(os.getcwd()) #Correct path is printed
# split the xyz file into different files for each profile
main_xyz = 'bathy_SPO_1984_50x50_profile.xyz'
number_lines = sum(1 for row in (open(main_xyz)))
print(number_lines) # 10854 is the output
rowsize = 201
for i in range(number_lines, rowsize):
profile_raw_df = pd.read_csv(main_xyz, delimiter=',', header=None, nrows=rowsize,
skiprows=i)
out_xyz = 'Profile' + str(i) + '.xyz'
profile_raw_df.to_csv(out_xyz, index=False,
header=False, mode='a')
Problems I am facing :
The for loop was at first giving output files as seen in the image,check Proof of output but now it does not produce any outputs and it is not rewriting the previous files either. The other mystery is that I am not getting an error either,check Code executed without error.
What I tried to fix the issue :
I updated all the packages and restarted Pycharm
I ran each line of code one by one and everything works until the for loop
While counting the number of rows in
number_lines = sum(1 for row in (open(main_xyz)))
you have exhausted the iterator that loops over the lines of the file. But you do not close the file. But this should not prevent Pandas from reading the same file.
A better idiom would be
with open(main_xyz) as fh:
number_lines = sum(1 for row in fh)
Your for loop as it stands does not do what you probably want. I guess you want:
for i in range(0, number_lines, rowsize):
so, rowsize is the step-size, instead of the end value of the for loop.
If you want to number the output files by data set, keep a counnt of the dataset, like this
data_set = 0
for i in range(0, number_lines, rowsize):
data_set += 1
...
out_xyz = f"Profile{data_set}.xyz"
...

Python 3.7 Start reading from a specific point within a csv

Hey I could really use help here. I've tried for 1 hour to find a solution for python but was unable to find it.
I am using Python 3.7
My input is a file provided by a customer - I cannot change it. It is structured in the following way:
It starts with random text not in CSV format and from line 3 on the rest of the file is in csv format.
text line
text line
text line or nothing
Enter
[Start of csv file] "column Namee 1","column Namee 2" .. until 6
"value1","value2" ... until 6 - continuing for many lines.
I wanted to extract the first 3 lines to create a pure CSV file but was unable to find code to only do it for a specific line range. It also seems the wrong solution as I think starting to read from a certain point should be possible.
Then I thought split () is the solution but it did not work for this format. The values are sometimes numbers, dates or strings. You cannot use the seek() method as they start differently.
Right now my dictreader takes the first line as an index and consequently the rest is rendered in chaos.
import csv
import pandas as pd
from prettytable import PrettyTable
with open(r'C:\Users\Hans\Downloads\file.csv') as csvfile:
csv_reader = csv.DictReader (r'C:\Users\Hans\Downloads\file.csv', delimiter=',')
for lines in csvfile:
print (lines)
If some answer for python has been found please link it, I was not able to find it.
Thank you so much for your help. I really appreciate it.
I will insist with the pandas option, given that the documentation clearly states that the skiprows parameter allows to skip n number of lines. I tried it with the example provided by #Chris Doyle (saving it to a file named line_file.csv) and it works as expected.
import pandas as pd
f = pd.read_csv('line_file.csv', skiprows=3)
Output
name num symbol
0 chris 4 $
1 adam 7 &
2 david 5 %
If you know the number of lines you want to skip then just open the file and read that many lines then pass the filehandle to Dictreader and it will read the remaining lines.
import csv
skip_n_lines = 3
with open('test.dat') as my_file:
for _ in range(skip_n_lines):
print("skiping line:", my_file.readline(), end='')
print("###CSV DATA###")
csv_reader = csv.DictReader(my_file)
for row in csv_reader:
print(row)
FILE
this is junk
this is more junk
last junk
name,num,symbol
chris,4,$
adam,7,&
david,5,%
OUTPUT
skiping line: this is junk
skiping line: this is more junk
skiping line: last junk
###CSV DATA###
OrderedDict([('name', 'chris'), ('num', '4'), ('symbol', '$')])
OrderedDict([('name', 'adam'), ('num', '7'), ('symbol', '&')])
OrderedDict([('name', 'david'), ('num', '5'), ('symbol', '%')])

Skip lines with strange characters when I read a file

I am trying to read some data files '.txt' and some of them contain strange random characters and even extra columns in random rows, like in the following example, where the second row is an example of a right row:
CTD 10/07/30 05:17:14.41 CTD 24.7813, 0.15752, 1.168, 0.7954, 1497.¸ 23.4848, 0.63042, 1.047, 3.5468, 1496.542
CTD 10/07/30 05:17:14.47 CTD 23.4846, 0.62156, 1.063, 3.4935, 1496.482
I read the description of np.loadtxt and I have not found a solution for my problem. Is there a systematic way to skip rows like these?
The code that I use to read the files is:
#Function to read a datafile
def Read(filename):
#Change delimiters for spaces
s = open(filename).read().replace(':',' ')
s = s.replace(',',' ')
s = s.replace('/',' ')
#Take the columns that we need
data=np.loadtxt(StringIO(s),usecols=(4,5,6,8,9,10,11,12))
return data
This works without using csv like the other answer and just reads line by line checking if it is ascii
data = []
def isascii(s):
return len(s) == len(s.encode())
with open("test.txt", "r") as fil:
for line in fil:
res = map(isascii, line)
if all(res):
data.append(line)
print(data)
You could use the csv module to read the file one line at a time and apply your desired filter.
import csv
def isascii(s):
len(s) == len(s.encode())
with open('file.csv') as csvfile:
csvreader = csv.reader(csvfile)
for row in csvreader:
if len(row)==expected_length and all((isascii(x) for x in row)):
'write row onto numpy array'
I got the ascii check from this thread
How to check if a string in Python is in ASCII?

Resources