I parse a CSV file into a Dataframe. 10,000 records go in, no problems.
Two columns one 'ID', one 'Reviews'.
I try to convert the DF into a dictionary where keys = 'ID', and values = 'Reviews'.
For some reason the new dictionary only contains 680 records.
#read csv data file
data = pd.read_csv("Movie_reviews.csv",
delimiter='\t',
header=None,names=['ID','Reviews'])
reviews = data.set_index('ID').to_dict().get('Reviews')
len(reviews)
output is 680
If I don't append '.get('Reviews')' everything is one big record.
the Dataframe 'data' looks like this
ID Reviews
1 076780192X it always amazes me how people can rate the DV...
2 0767821599 This movie is okay, but, its not worth what th...
3 0782008380 If you love the Highlander 1 movie and the ser...
4 0767726227 This is a great classic collection, if you lik...
5 0780621832 This is the second of John Ford and John Wayne...
6 0310263662 I am an evangelical Christian who believes in ...
7 0767809270 Federal law, in one of its numerous unfunded m...
In case it helps anyone else.
The id's for the movie reviews were not all unique. The .nunique() function revealed that as suggested by #YOLO.
Assigning only the values (Reviews) to the dictionary automatically added unique keys as suggested by #JackHoman resolving my issue.
I think you can do:
Method 1:
reviews = data.set_index('ID')['Reviews'].to_dict()
Method 2: Here we convert reviews to a list for each ID so that we don't lose any information.
reviews = data.groupby('ID')['Reviews'].apply(list).to_dict()
Related
I have built an isolation forest to detect anomalies for a csv file that I have, and I wanted to see how I can change the format of the data. Right now, the anomaly data is being outputted as a pandas dataframe, but I would like to alter it to be a json file, in the following format:
{seconds: #seconds for that row, size2: size2, pages: #pages for that row}
I have attached come of the code and a sample of the data, thank you so much!
model.fit(df[['label']])
df['anomaly']=model.fit_predict(df[['size2','size3','size4']])
#df['anomaly']= model.predict(df[['pages']])
print(model.predict(X_test))
anomaly = df.loc[df['anomaly']==-1]
anomaly_index = list(anomaly.index)
print(anomaly)
The output data looks something like this:
Unnamed: seconds: size2: ... size4: pages: anomaly:
1 40 32 654 1 -1
I have figured out a way to do this; I made multiple dictionaries, one mapping the index of the row to that timestamp, and one mapping the index of the row to the label. I was then able to keep track of which indexes were in the output data, and access all the information from those dictionaries.
I have a csv file with data like
Name Cost SKU QTY
Julia 1 13 10
John 5 23 1
Julia 3 40 5
I would like to return a dictionary as:
{'Julia':'10', 'John':'1', 'Julia':'5'....}
My code is returning no duplicates as of now.
Run this:
dict(zip(df.Name,df.QTY)
check this answer on stackoverflow (Creating a dictionary from a csv file?) first you have to read as dictonary and change values according to that, remember that dictionary contains only distinct keys if key already exists it will be replaced by new values on the fly.
d1 = dataset['End station'].head(20)
for x in d1:
x = re.compile("[0-9]{5}")
print(d1)
Using dataset['End_Station'] = dataset['End station'].map(lambda x: re.compile("([0-9]{5})").search(x).group())
shows - TypeError: expected string or bytes-like object.
I am new to data analysis, can't think of any other methods
Pandas has its own methods concerning Regex, so the "more pandasonic" way
to write code is just to use them, instead of native re methods.
Consider such example of source data:
End station
0 4055 Johnson Street, Chicago
1 203 Mayflower Avenue, Columbus
To find the street No in the above addresses, run:
df['End station'].str.extract(r'(?P<StreetNo>\d{,5})')
and you will get:
StreetNo
0 4055
1 203
Note also that the street No may be shorter than 5 digits, but you attempt
to match a sequence of just 5 digits.
Another weird point in your code: Why do you compile a regex in a loop
and then make no use of them?
Edit
After a more thorough look at your code I have a couple of additional remarks.
When you write:
for x in df:
...
then the loop iterates actually over column names (not rows).
Another weird point in your code is that x variable, used initially to hold
a column name, you use again to save a compiled regex there.
It is a bad habbit. Variables should be used to hold one clearly
defined object in each of them.
And as far as iteration over rows is concerned, you can use e.g.
for idx, row in df.iterrows():
...
But note that iterrows returns pairs composed of:
index of the current row,
the row itself.
Then (in the loop) you will probably refer to individual columns of this row.
I have 2 csv file, one is dictionary.csv which contains a list of words, and another is story.csv. In the story.csv there are many columns, and in one of the columns contains a lots of words called news_story. I wanted to check if the list of words from dictionary.csv exists in the news_story column. Afterwards i wanted to print all of the rows in which the news_story column contained words from the lists of words from dictionary.csv in a new csv file called New.csv
These are the codes i have tried so far
import csv
import pandas as pd
news=pd.read_csv("story.csv")
dictionary=pd.read_csv("dictionary.csv")
pattern = '|'.join(dictionary)
exist=news['news_story'].str.contains(pattern)
for CHECK in exist:
if not CHECK:
news['NEWcolumn']='NO'
else:
news['NEWcolumn']='YES'
news.to_csv('New.csv')
I kept on getting a nos eventhough there should be some trues
story.csv
news_url news_title news_date news_story
goog.com functional 2019 This story is about a functional requirement
live.com pbandJ 2001 I made a sandwich today
key.com uAndI 1992 A code name of a spy
dictionary.csv
red
tie
lace
books
functional
New.csv
news_url news_title news_date news_story
goog.com functional 2019 This story is about a functional requirement
First convert column to Series with header=None for avoid remove first value with squeeze=True in read_csv:
dictionary=pd.read_csv("dictionary.csv", header=None, squeeze=True)
print (dictionary)
0 red
1 tie
2 lace
3 books
4 functional
Name: 0, dtype: object
pattern = '|'.join(dictionary)
#for avoid match substrings use words boundaries
#pattern = '|'.join(r"\b{}\b".format(x) for x in dictionary)
Last filter by boolean indexing:
exist = news['news_story'].str.contains(pattern)
news[exist].to_csv('New.csv')
Detail:
print (news[exist])
news_url news_title news_date \
0 goog.com functional 2019
news_story
0 This story is about a functional requirement
I have a question whose variations have already been asked, but I'm not able to find an answer among all previous posts to my particular question. So I hope someone can help me ...
I have a csv file as such (in this example, there are a total of 18 rows, and 4 columns, with the first 2 rows containing headers).
"Employees.csv" 17
ID Name Value Location
25-2002 James 1.2919 Finance
25-2017 Matthew 2.359 Legal
30-3444 Rob 3.1937 Operations
55-8988 Fred 3.1815 Research
26-1000 Lisa 4.3332 Research
56-0909 John 3.3533 Legal
45-8122 Anna 3.8887 Finance
10-1000 Rachel 4.1448 Maintenance
30-9000 Frank 3.7821 Maintenance
25-3000 Angela 5.5854 Service
45-4321 Christopher 9.1598 Legal
44-9821 Maddie 8.5823 Service
20-4000 Ruth 7.47 Operations
50-3233 Vera 5.5092 Operations
65-2045 Sydney 3.4542 Executive
45-8720 Vladimir 0.2159 Finance
I'd like to round the values in the 3rd column to 2 decimals, i.e., round(value, 2). So basically, I want to open the file, read column #3 (minus first 2 rows), round each value, write them back, and save the file. After reading through other similar posts, I've now learned that it's best to always create a temp file to do this work instead of trying to change the same file at once. So I have the following code:
import csv, os
val = []
with open('path/Employees.csv', 'r') as rf, open('path/tmpf.csv, 'w') as tmpf:
reader = csv.reader(rf)
writer = csv.writer(tmpf)
for _ in range(2): #skip first 2 rows
next(reader)
for line in reader:
val.append(float(line[2])) # read 3 column into list 'val'
# [... this is where i got stuck!
# ... how do I round each value of val, and
# ... write everything back to the tmpf file ?]
os.remove('path/Employees.csv')
os.rename('path/tmpf', 'path/Employees.csv')
Thanks!
You could
rounded_val = [ round (v,2) for v in val ]
to generate the list of rounded values.