Python, Evaluating parsed data against a predefined list - python-3.x

I have a large set of data which is being parsed by feedparser and is enumerated as a string. For instance:
import feedparser
d = feedparser.parse('somefeed.xml')
will give elements such as d.entries['title'], d.entries['url'], and other d.entries which are strings.
I am wanting to compare these elements against a list I have defined to see if there is a match and I am not quite thinking something through correctly.
Below is what I tried, but got no output, any help is appreciated.
for i in d.entries:
my_list = ["Title One", "Title Two", "Etc"]
if my_list in i['title'].split("-"):
print(i)
If there is a match of parsed data and an element of my list, I want to print that element.

Related

Generate a list of strings from another list using python random and eliminate duplicates

I have the following list:
original_list = [('Anger', 'Envy'), ('Anger', 'Exasperation'), ('Joy', 'Zest'), ('Sadness', 'Suffering'), ('Joy', 'Optimism'), ('Surprise', 'Surprise'), ('Love', 'Affection')]
I am trying to create a random list comprising of the 2nd element of the tuples (of the above list) using the random method in such a way that duplicate values appearing as the first element are only considered once.
That is, the final list I am looking at, will be:
random_list = [Exasperation, Suffering, Optimism, Surprise, Affection]
So, in the new list random_list, strings Envy and Zest are eliminated (as they are appearin the the original list twice). And the process has to randomize the result, i.e. with each iteration would produce a different list of Five elements.
May I ask somebody to show me the way how may I do it?
You can use dictionary to filter the duplicates from original_list (shuffled before with random.sample):
import random
original_list = [
("Anger", "Envy"),
("Anger", "Exasperation"),
("Joy", "Zest"),
("Sadness", "Suffering"),
("Joy", "Optimism"),
("Surprise", "Surprise"),
("Love", "Affection"),
]
out = list(dict(random.sample(original_list, len(original_list))).values())
print(out)
Prints (for example):
['Optimism', 'Envy', 'Surprise', 'Suffering', 'Affection']

Extracting string from lists of dictionaries (or generator)

I am scraping data with scrapetube to get the video IDs of all the videos from a YouTube channel. The scrape code returns a generator object which I have converted to a list of dictionaries containting other dictionaries, lists and string. The scraping code works, but here still some sample data. I am only interested in the string video Id --> see picture for illustration purposes
How to iterate through all the video IDs in the string videoId and save them in a new variable (list or dataframe) for further processing?
import scrapetube
vid = scrapetube.get_channel('UC_zxivooFdvF4uuBosUnJxQ')
type(vid) #generator file
video = next(vid) #extract values from generator & then convert it
videoL = list(vid) #convert it to a list
#code not working
for item in videoL['videoId']:
entry = {}
videoId = item['videoId']
for i in range(len(videoId)):
entry.append(int(videoId[i][0:10]))
#error message: TypeError: list indices must be integers or slices, not str
I used code snippet from this post but can't seem to make it work.
It's helpful when you know the terminology so let's go through it step by step.
What is a generator?
A generator, like it's name implies, generates values on demand.
Their usefulness in this case is that if you don't want to have all the data in memory, you only iterate over one generated value at a time and only extract what you need.
Consider this:
def gen_one_million():
for i in range(0, 1_000_000):
yield i
for i in gen_one_million():
# do something with i
Rather than having a million elements in a list or some container in memory, you only get one at a time. If you want them all in a list it's very easy to do with list(gen_one_million()) but you're not tied to having them all in memory if you don't need them.
What is a list and how do I use them?
A list in python is a container represented by brackets []. To access elements in a list you can index into it i = my_list[0] or iterate over it.
for i in my_list:
# do something with i
What is a dict and how do I use them?
A dict is a python key/value container type represented by curly braces and a colon between the key and value. {key: value}
To access values in a dict you can reference the key who's value you want i = my_dict[key] where key is a string or integer or some other hashable type. You can also iterate over it.
for key in my_dict:
# do something with the key
for value in my_dict.values():
# do something with the key
for key, value in my_dict.items():
# do something with the key and value
How does my case fit into all this?
Looking at your sample data it looks like you already have it converted from a generator to a list.
[
{
'videoId': '8vCvSmAIv1s',
'thumbnail': {
'thumbnails': [
{
'url': 'https://i.ytimg.com/vi/8vCvSmAIv1s/hqdefault.jpg?sqp=-oaymwEbCKgBEF5IVfKriqkDDggBFQAAiEIYAXABwAEG&rs=AOn4CLDn3-yb8BvctGrMxqabxa_nH-UYzQ',
'width': 168,
'height': 94}, # etc..
}
]
}
}
]
However, since you just need to iterate over it and access the 'videoID' key in each generated dict, there's no reason to convert.
Just iterate directly over the generator and access the key of each generated dict.
video_ids = []
for item in vid:
video_ids.append(item['videoId'])
Or even better, as a list comprehension.
video_ids = [item['videoId'] for item in vid]

adding data to a nested list of lists

any help would be appreciated! I'm scraping multiple URLs and iterating over the URLs with a for loop. I'm putting relevant data into individual lists. however, I'm trying to organize my data in a list of lists to compare with other data... that I have't scraped yet. How do I iterate through the list of lists and put data into each element of the list? this doesn't seem that hard... don't know what I'm missing?
def get_info(item_urls)#, count): #count is being passed in, leaving this here for context
for item in item_urls:
#get data and stuff from current URL
data = ["beer", "is", "awesome!", "...", "for", "helping", "with", "my", "depression"]
count = len(data) # counting data for a number, that I should have just made up :)
table = [[] for i in range(0, count)]
for truth in data:
for i in range(0, count):
list('table[{}]'.format(i)).append(truth)
print(truth)
for thing in table[0]:
print(thing)
return "borked"
my fake logic:
for each element in data, append the element to table.
Once I iterate through all the URLs, I would like to return the entire built out table.
myList[i] iterates through a list. myList[i][j] iterates through elements in list of lists. j is the index for element in the inner list.

Loop json results

I'm totally new to python. I have this code:
import requests
won = 'https://api.pipedrive.com/v1/deals?status=won&start=0&api_token=xxxx'
json_data = requests.get(won).json()
deal_name = json_data ['data'][0]['title']
print(deal_name)
It prints the first title for me, but I would like it to loop through all titles in the json. But I can't figure out how. Can anyone guide me in the right direction?
You want to read up on dictionaries and lists. It seems like your json_data["data"] contains a list, so:
Seeing you wrote this:
deal_name = json_data ['data'][0]['title']
print(deal_name)
What you are looking for is:
for i in range(len(json_data["data"])):
print(json_data["data"][i]["title"])
Print it with a for loop
1. for item in json_data['data']: will take each element in the list json_data['data']
2. Then we print the title property of the object using the line print(item['title'])
Code:
import requests
won = 'https://api.pipedrive.com/v1/deals?status=won&start=0&api_token=xxxx'
json_data = requests.get(won).json()
for item in json_data['data']:
print(item['title'])
If you are ok with printing the titles as a list you can use List Comprehensions, Please refer the link in references to learn more.
print([x['title'] for x in json_data['data']])
References:
Python Loops
Python Lists
Python Comprehensions

List, tuples or dictionary, differences and usage, How can I store info in python

I'm very new in python (I usually write in php). I want to understand how to store information in an associative array, and if you can explain me whats the difference of "tuples", "arrays", "dictionary" and "list" will be wonderful (I tried to read different source but I still not caching it).
So This is my code:
#!/usr/bin/python3.4
import csv
import string
nidless_keys = dict()
nidless_keys = ['test_string1','test_string2'] #this contain the string to
# be searched in linesreader
data = {'type':[],'id':[]} #here I want to store my information
with open('path/to/csv/file.csv',newline="") as csvfile:
linesreader = csv.reader(csvfile,delimiter=',',quotechar="|")
for row in linesreader: #every line in this csv have a url like
#www.test.com/?test_string1&id=123456
current_row_string = str(row)
for needle in nidless_keys:
current_needle = str(needle)
if current_needle in current_row_string:
data[current_needle[current_row_string[-8:]]) += 1 # also I
#need to count per every id how much rows there are.
In conclusion:
my_data_stored = [current_needle][current_row_string[-8]]
current_row_string[-8] is a url which the last 8 digit of the url is an ID.
So the array should looks like this at the end of the script:
test_string1 = 123456 = 20
= 256468 = 15
test_string2 = 123155 = 10
Edit 1:
Which type I need here to store the information?
Can you tell me how to resolve this script?
It seems you want to count how many times an ID in combination with a test string occurs.
There can be multiple ID/count combinations associated with every test string.
This suggests that you should use a dictionary indexed by the test strings to store the results. In that dictionary I would suggest to store collections.Counter objects.
This way, you would have to add a special case when a key in the results dictionary isn't found to add an empty Counter. This is a common problem, so there is a specialized form of dictionary in the collections module called defaultdict.
import collections
import csv
# Using a tuple for the keys so it cannot be accidentally modified
keys = ('test_string1', 'test_string2')
result = collections.defaultdict(collections.Counter)
with open('path/to/csv/file.csv',newline="") as csvfile:
linesreader = csv.reader(csvfile,delimiter=',',quotechar="|")
for row in linesreader:
for key in keys:
if key in row:
id = row[-6:] # ID's are six digits in your example.
# The first index is into the dict, the second into the Counter.
result[key][id] += 1
There is an even easier way, by using regular expressions.
Since you seem to treat every row in a CSV file as a string, there is little need to use the CSV reader, so I'll just read the whole file as text.
import re
with open('path/to/csv/file.csv') as datafile:
text = datafile.read()
pattern = r'\?(.*)&id=(\d+)'
The pattern is a regular expression. This is a large topic in and of itself, so I'll only cover briefly what it does. (You might also want to check out the relevant HOWTO) At first glance it looks like complete gibberish, but it is actually a complete language.
In looks for two things in a line. Anything between ? and &id=, and a sequence of digits after &id=.
I'll be using IPython to give an example.
(If you don't know it, check out IPython. It is great for trying things and see if they work.)
In [1]: import re
In [2]: pattern = r'\?(.*)&id=(\d+)'
In [3]: text = """www.test.com/?test_string1&id=123456
....: www.test.com/?test_string1&id=123456
....: www.test.com/?test_string1&id=234567
....: www.test.com/?foo&id=234567
....: www.test.com/?foo&id=123456
....: www.test.com/?foo&id=1234
....: www.test.com/?foo&id=1234
....: www.test.com/?foo&id=1234"""
The text variable points to the string which is a mock-up for the contents of your CSV file.
I am assuming that:
every URL is on its own line
ID's are a sequence of digits.
If these assumptions are wrong, this won't work.
Using findall to extract every match of the pattern from the text.
In [4]: re.findall(pattern, test)
Out[4]:
[('test_string1', '123456'),
('test_string1', '123456'),
('test_string1', '234567'),
('foo', '234567'),
('foo', '123456'),
('foo', '1234'),
('foo', '1234'),
('foo', '1234')]
The findall function returns a list of 2-tuples (that is key, ID pairs). Now we just need to count those.
In [5]: import collections
In [6]: result = collections.defaultdict(collections.Counter)
In [7]: intermediate = re.findall(pattern, test)
Now we fill the result dict from the list of matches that is the intermediate result.
In [8]: for key, id in intermediate:
....: result[key][id] += 1
....:
In [9]: print(result)
defaultdict(<class 'collections.Counter'>, {'foo': Counter({'1234': 3, '123456': 1, '234567': 1}), 'test_string1': Counter({'123456': 2, '234567': 1})})
So the complete code would be:
import collections
import re
with open('path/to/csv/file.csv') as datafile:
text = datafile.read()
result = collections.defaultdict(collections.Counter)
pattern = r'\?(.*)&id=(\d+)'
intermediate = re.findall(pattern, test)
for key, id in intermediate:
result[key][id] += 1
This approach has two advantages.
You don't have to know the keys in advance.
ID's are not limited to six digits.
A brief summary of the python data types you mentioned:
A dictionary is an associative array, aka hashtable.
A list is a sequence of values.
An array is essentially the same as a list, but limited to basic datatypes. My impression is that they only exists for performance reasons, don't think I've ever used one. If performance is that critical to you, you probably don't want to use python in the first place.
A tuple is a fixed-length sequence of values (whereas lists and arrays can grow).
Lets take them one by one.
Lists:
List is a very naive kind of data structure similar to arrays in other languages in terms of the way we write them like:
['a','b','c']
This is a list in python , but seems very similar to array structure.
However there is a very large difference in the way lists are used in python and the usual arrays.
Lists are heterogenous in nature. This means that we can store any kind of data simultaneously inside it like:
ls = [1,2,'a','g',True]
As you can see, we have various kinds of data within a list and is a valid list.
However, one important thing about them is that we can access the list items using zero based indices. So we can write:
print ls[0],ls[3]
output: 1 g
Dictionary:
This datastructure is similar to a hash map data structure. It contains a (key,Value) pair. An empty dictionary looks like:
dc = {}
Now, to store a key,value pair, e.g., ('potato',3),(tomato,5), we can do as:
dc['potato'] = 3
dc['tomato'] = 5
and we saved the data in the dictionary dc.
The important thing is that we can even store another data structure element like a list within a dictionary like:
dc['list1'] = ls , where ls is the list defined above.
This shows the power of using dictionary.
In your case, you have difined a dictionary like this:
data = {'type':[],'id':[]}
This means that your dictionary will consist of only two keys and each key corresponds to a list, which are empty for now.
Talking a bit about your script, the expression :
current_row_string[-8:]
doesn't make a sense. The index should have been -6 instead of -8 that would give you the id part of the current row.
This part is the id and should have been stored in a variable say :
id = current_row_string[-6:]
Further action can be performed as seen the answer given by Roland.

Resources