Getting KeyError for pandas df column name that exists - python-3.x

I have
data_combined = pd.read_csv("/path/to/creole_data/data_combined.csv", sep=";", encoding='cp1252')
So, when I try to access these rows:
data_combined = data_combined[(data_combined["wals_code"]=="abk") &(data_combined["wals_code"]=="aco")]
I get a KeyError 'wals_code'. I then checked my list of col names with
print(data_combined.columns.tolist())
and saw the col name 'wals_code' in the list. Here's the first few items from the print out.
[',"wals_code","Order of subject, object and verb","Order of genitive and noun","Order of adjective and noun","Order of adposition and NP","Order of demonstrative and noun","Order of numeral and noun","Order of RC and noun","Order of degree word and adjective"]
Anyone have a clue what is wrong with my file?

The problem is the delimiter you're using when reading the CSV file. With sep=';', you instruct read_csv to use semicolons (;) as the separators for columns (cells and column headers), but it appears from your columns print out that your CSV file actually uses commas (,).
If you look carefully, you'll notice that your columns print out displays actually a list with one long string, not a list of individual strings representing the columns names.
So, use sep=',' instead of sep=';' (or just omit it entirely as , is the default value for sep):
data_combined = pd.read_csv("/path/to/creole_data/data_combined.csv", encoding='cp1252')

Related

extract position of char from each row & provide an aggregate of position across a list

I need some python help with this problem. Would appreciate any assistance !. Thanks.
I need an extracted matrix of values enclosed between square brackets. A toy example is below:
File Input will be in a txt file as below:
AB_1 Q[A]IHY[P]GVA
AB_2 Q[G][C]HY[R]GVA
AB_3 Q[G][C]HY[R]GV[D]
Answer out.txt: Script extracts index of char enclosed between sq.brackets "[]" for each row from input and makes an aggregate of index positions for the entire list. The aggregated index is then used to extract all of those positions from input file and produce a matrix as below.
Index 2,3,6,9
AB_1 [A],I,[P],A
AB_2 [G],[C],[R],A
AB_3 [G],[C],[R],[D]
Any help would be greatly appreciated !. Thanks.
If you want to reduce your table to only those columns in which an entry with square-brackets appears, you can go with this:
import re
def transpose(matrix):
return [[matrix[j][i] for j in range(len(matrix))] for i in range(len(matrix[0]))]
with open("test_table.txt", "r") as f:
content = f.read()
rows = [re.findall(r'(\[.\]|.)', line.split()[1]) for line in content.split("\n")]
columns = transpose(rows)
matching_columns = [[str(i + 1)] + column for i, column in enumerate(columns) if "[" in "".join(column)]
matching_rows = transpose(matching_columns)
headline = ["Index {}".format(",".join(matching_rows[0]))]
target_table = headline + ["AB_{0} {1}".format((i + 1), ",".join(line)) for i, line in enumerate(matching_rows[1:])]
with open("out.txt", "w") as f:
f.write("\n".join(target_table))
First of all you want the content of your .txt file to be represented in arrays. Unfortunately your input data has no seperators yet (as in .csv files) so you need to take care of that. To get a string like this "Q[A]IHY[P]GVA" sorted out I would recommend working with regular expressions.
import re
cells = re.findall(r'(\[.\]|.)', "Q[A]IHY[P]GVA")
The pattern within the r'' string matches a single character within square brackets or just a single character. The re.findall() method returns a list of all matching substrings, in this case: ['Q', '[A]', 'I', 'H', 'Y', '[P]', 'G', 'V', 'A']
rows = [re.findall(r'(\[.\]|.)', line.split()[1]) for line in content.split("\n")] applies this method on every line in your file. The line.split()[1] will leave out the row label "AB_X " as it is not usefull.
Having your data sorted in columns is more fitting, because you want to preserve all columns that match a certain condition (contain an entry in brackets). For this you can just transpose rows. This is done by the transpose() function. If you have imported numpy numpy.transpose(rows) would be the better option I guess.
Next you want to get all columns that satisfy your condition "[" in "".join(column). All done in one line by: matching_columns = [[str(i + 1)] + column for i, column in enumerate(columns) if "[" in "".join(column)] Here [str(i + 1)] does add the column index that you want to use later.
The rest now is easy: Transpose the columns back to rows, relabel the rows, format the row data into strings that fit your desired output format and then write those strings to the out.txt file.

Remove double quotes while printing string in dataframe to text file

I have a dataframe which contains one column with multiple strings. Here is what the data looks like:
Value
EU-1050-22345,201908 XYZ DETAILS, CD_123_123;CD_123_124,2;1
There are almost 100,000 such rows in the dataframe. I want to write this data into a text file.
For this, I tried the following:
df.to_csv(filename, header=None,index=None,mode='a')
But I am getting the entire string in quotes when I do this. The output I obtain is:
"EU-1050-22345,201908 XYZ DETAILS, CD_123_123;CD_123_124,2;1"
But what I want is:
EU-1050-22345,201908 XYZ DETAILS, CD_123_123;CD_123_124,2;1 -> No Quotes
I also tried this:
df.to_csv(filename,header=None,index=None,mode='a',
quoting=csv.QUOTE_NONE)
However, I get an error that an escapechar is required. If i add escapechar='/' into the code, I get '/' in multiple places (but no quotes). I don't want the '/' either.
Is there anyway I can remove the quotes while writing into a text file WITHOUT adding any other escape characters ?
Based on OP's comment, I believe the semicolon is messing things up. I no longer have unwanted \ if using tabs to delimit csv.
import pandas as pd
import csv
df = pd.DataFrame(columns=['col'])
df.loc[0] = "EU-1050-22345,201908 XYZ DETAILS, CD_123_123;CD_123_124,2;1"
df.to_csv("out.csv", sep="\t", quoting=csv.QUOTE_NONE, quotechar="", escapechar="")
Original Answer:
According to this answer, you need to specify escapechar="\\" to use csv.QUOTE_NONE.
Have you tried:
df.to_csv("out.csv", sep=",", quoting=csv.QUOTE_NONE, quotechar="", escapechar="\\")
I was able to write a df to a csv using a single space as the separator and get the "quotes" around strings removed by replacing existing in-string spaces in the dataframe with non-breaking spaces before I wrote it as as csv.
df = df.applymap(lambda x: str(x).replace(' ', u"\u00A0"))
df.to_csv(outpath+filename, header=True, index=None, sep=' ', mode='a')
I couldn't use a tab delimited file for what I was writing output for, though that solution also works using additional keywords to df.to_csv(): quoting=csv.QUOTE_NONE, quotechar="", escapechar="")

pandas.read_clipboard only reads whole lines not columns

I transferred all my python3 codes from macOS to Ubuntu 18.04 and in one program I need to use pandas.clipboard(). At this point of time there is a list in the clipboard with multiple lines and columns divided by tabs and each element in quotation marks.
After just trying
import pandas as pd
df = pd.read_clipboard()
I'm getting this error: pandas.errors.ParserError: Expected 8 fields in line 3, saw 11. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.. And line 3 looks like "word1" "word2 and another" "word3" .... Without the quotation marks you count 11 elements and within quotation marks you count 8.
In the next step I tried
import pandas as pd
df = pd.read_clipboard(sep='\t')
and I'm getting no errors but it results only in a Series with each line of the clipboard source in one element.
Yes, maybe it's a solution to write a code for separating each element of a line after this step but because it's working very well under macOS (with just pd.read_clipboard()) I hope that there's a better solution.
Thank you for helping.
I wrote a "turnaround" for my question. It's not the exact solution but because I just need the elements of one column in an array I solved it like that:
import pyperclip
# read clipboard
cb = pyperclip.paste()
# lines in array
cb_arr = cb.splitlines()
column = []
for cb_line in cb_arr:
# words in array
cb_words = cb_line.split("\"")
# pick element of column 1
word = cb_words[1]
column.append(word)
# delete column name
column.pop(0)
print(column)
Maybe it helps someone else, too.

Save Tweets as .csv, Contains String Literals and Entities

I have tweets saved in JSON text files. I have a friend who wants tweets containing keywords, and the tweets need to be saved in a .csv. Finding the tweets is easy, but I run into two problems and am struggling with finding a good solution.
Sample data are here. I have included the .csv file that is not working as well as a file where each row is a tweet in JSON format.
To get into a dataframe, I use pd.io.json.json_normalize. It works smoothly and handles nested dictionaries well, but pd.to_csv does not work because it does not handle, as far as I can tell, string literals well. Some of the tweets contain '\n' in the text field, and pandas writes new lines when that happens.
No problem, I process pd['text'] to remove '\n'. The resulting file still has too many rows, 1863 compared to the 1388 it should. I then modified my code to replace all string-literals:
tweets['text'] = [item.replace('\n', '') for item in tweets['text']]
tweets['text'] = [item.replace('\r', '') for item in tweets['text']]
tweets['text'] = [item.replace('\\', '') for item in tweets['text']]
tweets['text'] = [item.replace('\'', '') for item in tweets['text']]
tweets['text'] = [item.replace('\"', '') for item in tweets['text']]
tweets['text'] = [item.replace('\a', '') for item in tweets['text']]
tweets['text'] = [item.replace('\b', '') for item in tweets['text']]
tweets['text'] = [item.replace('\f', '') for item in tweets['text']]
tweets['text'] = [item.replace('\t', '') for item in tweets['text']]
tweets['text'] = [item.replace('\v', '') for item in tweets['text']]
Same result, pd.to_csv saves a file with more rows than actual tweets. I could replace string literals in all columns, but that is clunky.
Fine, don't use pandas. with open(outpath, 'w') as f: and so on creates a .csv file with the correct number of rows. Reading the file, either with pd.read_csv or reading line by line will fail, however.
It fails because of how Twitter handles entities. If a tweet's text contains a url, mention, hashtag, media, or link, then Twitter returns a dictionary that contains commas. When pandas flattens the tweet, the commas get preserved within a column, which is good. But when the data are read in, pandas splits what should be one column into multiple columns. For example, a column might look like [{'screen_name': 'ProfOsinbajo','name': 'Prof Yemi Osinbajo','id': 2914442873,'id_str': '2914442873', 'indices': [0,' 13]}]', so splitting on commas creates too many columns:
[{'screen_name': 'ProfOsinbajo',
'name': 'Prof Yemi Osinbajo',
'id': 2914442873",
'id_str': '2914442873'",
'indices': [0,
13]}]
That is the outcome whether I use with open(outpath) as f: as well. With that approach, I have to split lines, so I split on commas. Same problem - I do not want to split on commas if they occur in a list.
I want those data to be treated as one column when saved to file or read from file. What am I missing? In terms of the data at the repository above, I want to convert forstackoverflow2.txt to a .csv with as many rows as tweets. Call this file A.csv, and let's say it has 100 columns. When opened, A.csv should also have 100 columns.
I'm sure there are details I've left out, so please let me know.
Using the csv module works. It writes the file out as a .csv while counting the lines, then reads it back in and counts the lines again.
The result matched, and opening the .csv in Excel also gives 191 columns and 1338 lines of data.
import json
import csv
with open('forstackoverflow2.txt') as f,\
open('out.csv','w',encoding='utf-8-sig',newline='') as out:
data = json.loads(next(f))
print('columns',len(data))
writer = csv.DictWriter(out,fieldnames=sorted(data))
writer.writeheader() # write header
writer.writerow(data) # write the first line of data
for i,line in enumerate(f,2): # start line count at two
data = json.loads(line)
writer.writerow(data)
print('lines',i)
with open('out.csv',encoding='utf-8-sig',newline='') as f:
r = csv.DictReader(f)
lines = list(r)
print('readback columns',len(lines[0]))
print('readback lines',len(lines))
Output:
columns 191
lines 1338
readback lines 1338
readback columns 191
#Mark Tolonen's answer is helpful, but I ended up going a separate route. When saving the tweets to file, I removed all \r, \n, \t, and \0 characters from anywhere in the JSON. Then, I save the file with as tab separated so that commas in fields like location or text do not confuse a read function.

Skipping over array elements of certain types

I have a csv file that gets read into my code where arrays are generated out of each row of the file. I want to ignore all the array elements with letters in them and only worry about changing the elements containing numbers into floats. How can I change code like this:
myValues = []
data = open(text_file,"r")
for line in data.readlines()[1:]:
myValues.append([float(f) for f in line.strip('\n').strip('\r').split(',')])
so that the last line knows to only try converting numbers into floats, and to skip the letters entirely?
Put another way, given this list,
list = ['2','z','y','3','4']
what command should be given so the code knows not to try converting letters into floats?
You could use try: except:
for i in list:
try:
myVal.append(float(i))
except:
pass

Resources