Creating a dictionary of dictionaries from csv file - string

Hi so I am trying to write a function, classify(csv_file) that creates a default dictionary of dictionaries from a csv file. The first "column" (first item in each row) is the key for each entry in the dictionary and then second "column" (second item in each row) will contain the values.
However, I want to alter the values by calling on two functions (in this order):
trigram_c(string): that creates a default dictionary of trigram counts within the string (which are the values)
normal(tri_counts): that takes the output of trigram_c and normalises the counts (i.e converts the counts for each trigram into a number).
Thus, my final output will be a dictionary of dictionaries:
{value: {trigram1 : normalised_count, trigram2: normalised_count}, value2: {trigram1: normalised_count...}...} and so on
My current code looks like this:
def classify(csv_file):
l_rows = list(csv.reader(open(csv_file)))
classified = dict((l_rows[0], l_rows[1]) for rows in l_rows)
For example, if the csv file was:
Snippet1, "It was a dark stormy day"
Snippet2, "Hello world!"
Snippet3, "How are you?"
The final output would resemble:
{Snippet1: {'It ': 0.5352, 't w': 0.43232}, Snippet2: {'Hel' : 0.438724,...}...} and so on.
(Of course there would be more than just two trigram counts, and the numbers are just random for the purpose of the example).
Any help would be much appreciated!

First of all, please check classify function, because I can't run it. Here corrected version:
import csv
def classify(csv_file):
l_rows = list(csv.reader(open(csv_file)))
classified = dict((row[0], row[1]) for row in l_rows)
return classified
It returns dictionary with key from first column and value is string from second column.
So you should iterate every dictionary entry and pass its value to trigram_c function. I didn't understand how you calculated trigram counts, but for example if you just count the number of trigram appearence in string you could use the function below. If you want make other counting you just need to update code in the for loop.
def trigram_c(string):
trigram_dict = {}
start = 0
end = 3
for i in range(len(string)-2):
# you could implement your logic in this loop
trigram = string[start:end]
if trigram in trigram_dict.keys():
trigram_dict[trigram] += 1
else:
trigram_dict[trigram] = 1
start += 1
end += 1
return trigram_dict

Related

Is there an python solution for mapping a (pandas data frame) with (unique values of Split a string column)

I have a data frame (df).
The Data frame contains a string column called: supported_cpu.
The (supported_cpu) data is a string type separated by a comma.
I want to use this data for the ML model.
enter image description here
I had to get unique values for the column (supported_cpu). The output is a (list) of unique values.
def pars_string(df,col):
#Separate the column from the string using split
data=df[col].value_counts().reset_index()
data['index']=data['index'].str.split(",")
# Create a list including all of the items, which is separated by column
df_01=[]
for i in range(data.shape[0]):
for j in data['index'][i]:
df_01.append(j)
# get unique value from sub_df
list_01=list(set(df_01))
# there are some leading or trailing spaces in the list_01 which need to be deleted to get unique value
list_02=[x.strip(' ') for x in list_01]
# get unique value from list_02
list_03=list(set(list_02))
return(list_03)
supported_cpu_list = pars_string(df=df,col='supported_cpu')
The output:
enter image description here
I want to map this output to the data frame to encode it for the ML model.
How could I store the output in the data frame? Note : Some row have a multi-value(more than one CPU)
Input: string type separated by a column
output: I did not know what it should be.
Input: string type separated by a column
output: I did not know what it should be.
I really recommend to anyone who's starting using pandas to read about vectorization and thinking in terms of columns (aka Series). This is the way it was build and it is the way in which its supposed to be used.
And from what I understand (I may be wrong) is that you want to get unique values from supported_cpu column. So you could use the Series methods on string to split that particular column, then flatten the resulting array using internal `chain
from itertools import chain
df['supported_cpu'] = df['supported_cpu'].str.split(pat=',')
unique_vals = set(chain(*df['supported_cpus'].tolist()))
unique_vals = (item for item in unique_vals if item)
Multi-values in some rows should be parsed to single values for later ML model training. The list can be converted to dataframe simply by pd.DataFrame(supported_cpu_list).

Get dict values based on a filtered list with a varying number of returned elements

I am filtering elements from a list that include '"' by the following code:
def sizes():
new_list = [x for x in mid_item_size_one if '"' in x]
return new_list
This will return any element with '"' as desired. Example strings below.
Random Text 0.5" Random Text
0.25" Random Text
1.5" x 0.5" Random Text
I .split then apply above function to return:
['0.5"']
['0.25"']
['1.5"', '0.5"']
I now need to lookup each of the elements in a dictionary and return the value from the key:value pair as new individual variables so I will be able to add them to a new string that contains a number of other variables. See example of desired Result below:
val_1 = '0.5"'
val_1 = '0.5"'
val_1 = '1.5"' and val_2 = '0.5"'
Random Text val_1 Random Text
Random Text val_1 Random Text
Random Text val_1 Random Text val_2
I already have my function to lookup/retrieve the value from dictionary however since I started retrieving the values via filter, I haven't been able to figure out how to retrieve the dict value.
def item_size_one_final(size_dict):
for x in sizes():
for key in size_dict:
if key in sizes():
return size_dict[key]
return "Hmmmm"
return"Not Working"
The above for loops result in ['Hmmmm'] on all of it. Does anyone have any suggestions for how to do this?
#Riedler - Sure, hopefully this example helps.
size_dict = {
'1/4"' : '8mm - 1/4"',
'1.5"' : '40mm - 11/2"',
'0.5"' : '15mm - 1/2"',}
Raw Data Input:
0.5" Pipe
1/4" Flange
1.5" x 0.5" Reducer
My company uses SAP with set item codes and item description formats so I am taking those three descriptions and putting them in our format. This:
def sizes():
new_list = [x for x in mid_item_size_one if '"' in x]
return new_list
add a step in between to split into a list then this is returned:
['0.5"']
['0.25"']
['1.5"', '0.5"']
From this point, I need to run these elements through my dictionary and get the value (key:value pair). My current for loop doesn't work and I'm not sure why or what I can alter to correct it.
The final result should be:
var_1 = 15mm - 1/2"
var_1 = 8mm - 1/4"
var_1 = 40mm - 11/2", var_2 = 15mm - 1/2"
If I understand correctly, your mistake is that you are cycling over the dictionary for every value in the list of sizes, why?
The idea of a dictionary is that when you have a value that matches a key, you can get the matching value from the dictionary.
Also, there is no need to call sizes twice to receive the same values, save the result before and then use that.
def item_size_one_final(size_dict):
sizes_lst = sizes()
res = []
for x in sizes_lst:
if x in size_dict:
res.append(size_dict[x])
return res

Look up a number inside a list within a pandas cell, and return corresponding string value from a second DF

(I've edited the first column name in the labels_df for clarity)
I have two DataFrames, train_df and labels_df. train_df has integers that map to attribute names in the labels_df. I would like to look up each number within a given train_df cell and return in the adjacent cell, the corresponding attribute name from the labels_df.
So fore example, the first observation in train_df has attribute_ids of 147, 616 and 813 which map to (in the labels_df) culture::french, tag::dogs, tag::men. And I would like to place those strings inside one cell on the same row as the corresponding integers.
I've tried variations of the function below but fear I am wayyy off:
def my_mapping(df1, df2):
tags = df1['attribute_ids']
for i in tags.iteritems():
df1['new_col'] = df2.iloc[i]
return df1
The data are originally from two csv files:
train.csv
labels.csv
I tried this from #Danny :
sample_train_df['attribute_ids'].apply(lambda x: [sample_labels_df[sample_labels_df['attribute_name'] == i]
['attribute_id_num'] for i in x])
*please note - I am running the above code on samples of each DF due to run times on the original DFs.
which returned:
I hope this is what you are looking for. i am sure there's a much more efficient way using look up.
df['new_col'] = df['attribute_ids'].apply(lambda x: [labels_df[labels_df['attribute_id'] == i]['attribute_name'] for i in x])
This is super ugly and one day, hopefully sooner than later, i'll be able to accomplish this task in an elegant fashion though, until then, this is what got me the result I need.
split train_df['attribute_ids'] into their own cell/column
helper_df = train_df['attribute_ids'].str.split(expand=True)
combine train_df with the helper_df so I have the id column (they are photo id's)
train_df2 = pd.concat([train_df, helper_df], axis=1)
drop the original attribute_ids column
train_df2.drop(columns = 'attribute_ids', inplace=True)
rename the new columns
train_df2.rename(columns = {0:'attr1', 1:'attr2', 2:'attr3', 3:'attr4', 4:'attr5', 5:'attr6',
6:'attr7', 7:'attr8', 8:'attr9', 9:'attr10', 10:'attr11'})
convert the labels_df into a dictionary
def create_file_mapping(df):
mapping = dict()
for i in range(len(df)):
name, tags = df['attribute_id_num'][i], df['attribute_name'][i]
mapping[str(name)] = tags
return mapping
map and replace the tag numbers with their corresponding tag names
train_df3 = train_df2.applymap(lambda s: my_map.get(s) if s in my_map else s)
create a new column of the observations tags in a list of concatenated values
helper1['new_col'] = helper1[helper1.columns[0:10]].apply(lambda x: ','.join(x.astype(str)), axis = 1)

Comparing user input list with dictionary and printing out corresponding value

Starting out by saying this is for school and I'm still learning so I'm not looking for a direct solution.
What I want to do is take an input from a user (one word or more).
I then make it in to a list.
I have my dictionary and the code that I'm posting is printing out the values correctly.
My question is how do I compare the characters in my list to the keys in the dictionary and then print only those values that correspond to the keys?
I have also read a ton of different questions regarding dictionaries but it was no help at all.
Example on output;
Word: wow
Output: 96669
user_word = input("Please enter a word: ")
user_listed = list(user_word)
def keypresses():
my_dict = {'.':1, ',':11, '?':111, '!':1111, ':':11111, 'a':2, 'b':22, 'c':222, 'd':3, 'e':33, 'f':333, 'g':4, 'h':44,
'i':444, 'j':5, 'k':55, 'l':555, 'm':6, 'n':66, 'o':666, 'p':7, 'q':77, 'r':777, 's':7777, 't':8, 'u':88,
'v':888, 'w':9, 'x':99, 'y':999, 'z':9999, ' ':0}
for key, value in my_dict.items():
print(value)
I am not going to hand you code for the project, but I will definitely send you in a right direction;
so, 2 parts to this in my view; match each character to a key/get a value, and combine the numbers for an output.
For the first part, you can iterate character-by-character by simply making a for loop;
for letter in 'string':
print(letter)
would output s t r i n g. So you can use this to find the value of the key(each letter)
Then, you can get the definition as a string(so as not to add each number mathematically) so something like;
letter = 'w'
value = my_dict[letter]
value_as_string = str(value)
then, combine this all into a for loop and add each string to each other to create the desired output.

String to dictionary word count and display

I have a homework question which asks:
Write a function print_word_counts(filename) that takes the name of a
file as a parameter and prints an alphabetically ordered list of all
words in the document converted to lower case plus their occurrence
counts (this is how many times each word appears in the file).
I am able to get an out of order set of each word with it's occurrence; however when I sort it and make it so each word is on a new line the count disappears.
import re
def print_word_counts(filename):
input_file = open(filename, 'r')
source_string = input_file.read().lower()
input_file.close()
words = re.findall('[a-zA-Z]+', source_string)
counts = {}
for word in words:
counts[word] = counts.get(word, 0) + 1
sorted_count = sorted(counts)
print("\n".join(sorted_count))
When I run this code I get:
a
aborigines
absence
absolutely
accept
after
and so on.
What I need is:
a: 4
aborigines: 1
absence: 1
absolutely: 1
accept: 1
after: 1
I'm not sure how to sort it and keep the values.
It's a homework question, so I can't give you the full answer, but here's enough to get you started. Your mistake is in this line
sorted_count = sorted(counts)
Firstly, you cant sort a dictionary by nature. Secondly, what this does is take the keys of the dictionary, sorts them, and returns a list.
You can just print the value of counts, or, if you really need them in sorted order, consider changing the dictionary items into a list, then sorting them.
lst = list(count.items())
#sort and return lst

Resources