any help would be appreciated! I'm scraping multiple URLs and iterating over the URLs with a for loop. I'm putting relevant data into individual lists. however, I'm trying to organize my data in a list of lists to compare with other data... that I have't scraped yet. How do I iterate through the list of lists and put data into each element of the list? this doesn't seem that hard... don't know what I'm missing?
def get_info(item_urls)#, count): #count is being passed in, leaving this here for context
for item in item_urls:
#get data and stuff from current URL
data = ["beer", "is", "awesome!", "...", "for", "helping", "with", "my", "depression"]
count = len(data) # counting data for a number, that I should have just made up :)
table = [[] for i in range(0, count)]
for truth in data:
for i in range(0, count):
list('table[{}]'.format(i)).append(truth)
print(truth)
for thing in table[0]:
print(thing)
return "borked"
my fake logic:
for each element in data, append the element to table.
Once I iterate through all the URLs, I would like to return the entire built out table.
myList[i] iterates through a list. myList[i][j] iterates through elements in list of lists. j is the index for element in the inner list.
Related
I am scraping data with scrapetube to get the video IDs of all the videos from a YouTube channel. The scrape code returns a generator object which I have converted to a list of dictionaries containting other dictionaries, lists and string. The scraping code works, but here still some sample data. I am only interested in the string video Id --> see picture for illustration purposes
How to iterate through all the video IDs in the string videoId and save them in a new variable (list or dataframe) for further processing?
import scrapetube
vid = scrapetube.get_channel('UC_zxivooFdvF4uuBosUnJxQ')
type(vid) #generator file
video = next(vid) #extract values from generator & then convert it
videoL = list(vid) #convert it to a list
#code not working
for item in videoL['videoId']:
entry = {}
videoId = item['videoId']
for i in range(len(videoId)):
entry.append(int(videoId[i][0:10]))
#error message: TypeError: list indices must be integers or slices, not str
I used code snippet from this post but can't seem to make it work.
It's helpful when you know the terminology so let's go through it step by step.
What is a generator?
A generator, like it's name implies, generates values on demand.
Their usefulness in this case is that if you don't want to have all the data in memory, you only iterate over one generated value at a time and only extract what you need.
Consider this:
def gen_one_million():
for i in range(0, 1_000_000):
yield i
for i in gen_one_million():
# do something with i
Rather than having a million elements in a list or some container in memory, you only get one at a time. If you want them all in a list it's very easy to do with list(gen_one_million()) but you're not tied to having them all in memory if you don't need them.
What is a list and how do I use them?
A list in python is a container represented by brackets []. To access elements in a list you can index into it i = my_list[0] or iterate over it.
for i in my_list:
# do something with i
What is a dict and how do I use them?
A dict is a python key/value container type represented by curly braces and a colon between the key and value. {key: value}
To access values in a dict you can reference the key who's value you want i = my_dict[key] where key is a string or integer or some other hashable type. You can also iterate over it.
for key in my_dict:
# do something with the key
for value in my_dict.values():
# do something with the key
for key, value in my_dict.items():
# do something with the key and value
How does my case fit into all this?
Looking at your sample data it looks like you already have it converted from a generator to a list.
[
{
'videoId': '8vCvSmAIv1s',
'thumbnail': {
'thumbnails': [
{
'url': 'https://i.ytimg.com/vi/8vCvSmAIv1s/hqdefault.jpg?sqp=-oaymwEbCKgBEF5IVfKriqkDDggBFQAAiEIYAXABwAEG&rs=AOn4CLDn3-yb8BvctGrMxqabxa_nH-UYzQ',
'width': 168,
'height': 94}, # etc..
}
]
}
}
]
However, since you just need to iterate over it and access the 'videoID' key in each generated dict, there's no reason to convert.
Just iterate directly over the generator and access the key of each generated dict.
video_ids = []
for item in vid:
video_ids.append(item['videoId'])
Or even better, as a list comprehension.
video_ids = [item['videoId'] for item in vid]
I am currently trying to write a function to iterate through a nested list and check if one item from the list, 'team', is already in a separate list 'teams'.
If it is not, I want to append a nested list, 'player_values' with a different item from the original nested list that was examined, in the form of a new list in the nested list.
If it is, I want to append the nested list 'player_values' with the item from the original nested list, but I want to add it to the most recent list in the nested list 'player_values' instead of creating a new list.
Currently, my code looks like this :
def teams_and_games(list, player, idx):
teams = []
player_values = []
x = 0
y = -1
for rows in list:
if player == list[x][BD.player_id] and list[x][BD.team] not in teams:
teams.append(list[x][BD.team])
player_values.append([list[x][idx]])
x += 1
y += 1
elif player == list[x][BD.player_id]:
player_values[y].append(list[x][idx])
x += 1
return player_values, teams
However, when I run the code in my main, using
values, teams = teams_and_games(NiceRow, name, BD.games)
print(values)
print(teams)
It only prints empty lists. The fact that it prints empty lists shows that it is returning the correct variables, but I can't figure out why the code in the function is failing to add anything to the lists. I have tried switching the .append with a more simple list += statement, but the result has been the same so far.
Ideally, I would be getting a nested list, containing an amount of lists equal to the number of items added to the other 'teams' list, and the list of teams in the order they were added.
The data I am working with is a nested list pulled from a .csv file, which has been formatted slightly using the .strip() and .split() commands. Each number has been converted to an int, and strings left as they are. The .CSV file it is from has 19 columns and ~80,000 rows, with each column always being either a string or an int.
How to sort the data that are stored in a global list after inserting them within a method; so that before they are stacked into another list in accordance to their inserted elements? Or is this a bad practice and complicate things in storing data inside of a global list instead of seperated ones within a method; and finally sorting them thereafter ?
Below is the example of the scenario
list= []
dictionary = {}
def MethodA(#returns title):
#searches for corresponding data using beautifulsoup
#adds data into dictionary
# list.append(dictionary)
# returns list
def MethodB(#returns description):
#searches for corresponding data using beautifulsoup
#adds data into dictionary
# list.append(dictionary)
# returns list
Example of Wanted output
MethodA():[title] #scraps(text.title) data from the web
MethodB():[description] #scraps(text.description) from the web
#print(list)
>>>list=[{title,description},{title.description},{title,description},{title.description}]
Actual output
MethodA():[title] #scraps(text.title) data from the web
MethodB():[description] #scraps(text.description) from the web
#print(list)
>>>list =[{title},{title},{description},{description}]
There are a few examples I've seen; such as using Numpy and sorting them in an Array;-
arraylist = np.array(list)
arraylist[:, 0]
#but i get a 'too many indices for array'-
#because I have too much data loading in; including that some of them
#do not have data and are replaced as `None`; so there's an imbalance of indexes.
Im trying to keep it as modulated as possible. I've tried using the norm of iteration;
but it's sort of complicated because I have to indent more loops in it;
I've tried Numpy and Enumerate, but I'm not able to understand how to go about with it. But because it's an unbalanced list; meaning that some value are returned as Nonegives me the return error that; all the input array dimensions except for the concatenation axis must match exactly
Example : ({'Toy Box','Has a toy inside'},{'Phone', None }, {'Crayons','Used for colouring'})
Update; code sample of methodA
def MethodA(tableName, rowName, selectedLink):
try:
for table_tag in selectedLink.find_all(tableName, {'class': rowName}):
topic_title = table_tag.find('a', href=True)
if topic_title:
def_dict1 = {
'Titles': topic_title.text.replace("\n", "")}
global_list.append(def_dict1 )
return def_dict1
except:
def_dict1 = None
Assuming you have something of the form:
x = [{'a'}, {'a1'}, {'b'}, {'b1'}, {'c'}, {None}]
you can do:
dictionary = {list(k)[0]: list(v)[0] for k, v in zip(x[::2], x[1::2])}
or
dictionary = {s.pop(): v.pop() for k, v in zip(x[::2], x[1::2])}
The second method will clear your sets in x
I have a dictionary where it includes few sub-dictionaries in it. Each sub-dictionary has many keys. After running a for loop with an if condition too, the results are generated. I want to add ALL the results to under the desired key; but all what my code actually does is adding the result of the last iteration of the loop thereby replacing the value of the previous iteration.
But, actually, i want to print all the results.
for item in list1: #item is a tuple & list1 has tuples in it
if item == node_pair: #node pair is another tuple
high_p[i]["links"] = link_name #"links" is the key
desired output:
"links": [link_name1, link_name2, link_name3]
what i get:
"links" : link_name3
Please guide me..
So each sub-dictionary needs to have lists as values. You could pre-populate each sub-dictionary with lists ahead of time, but it's easier to create them on demand using setdefault.
for item in list1:
if item == node_pair:
high_p[i].setdefault("links", []).append(link_name)
I'm using scrapy to scrape stock premarket data. Here is the code being used to scrape the website:
def parse(self, response):
for sel in response.xpath('//body'):
item = PremarketItem()
item['volume'] = sel.xpath('//td[#class="tdVolume"]/text()').extract()
item['last_price'] = sel.xpath('//div[#class="lastPrice"]/text()')[:30].extract()
item['percent_change'] = sel.xpath(
'//div[#class="chgUp"]/text()')[:15].extract() + sel.xpath('//div[#class="chgDown"]/text()')[:15].extract()
item['ticker'] = sel.xpath('//a[#class="symbol"]/text()')[:30].extract()
yield item
The output of the following code into the .csv file is something along the lines of this:
ticker,percent_change,last_price,volume
"HTGM,SNCR,SAEX,IMMU,OLED,DAIO","27.43%,20.39%,17.28%,17.19%,15.69%","5,298350,700,1090000,76320,27190,13010",etc
As you can see, the values are separated correctly, but they're all stuck in massive strings. I've tried multiple for loops, but nothing has worked, and I can't find anything. Thank you for the help!
Instead of splitting the massive strings you can fix the scrapy code so that the values are separated in the first place.
Your item XPaths start with // selecting all elements matching your specification and thus outputting all elements in one (massive) item. I suppose your target website has some structure with respect to the target items e.g. table rows.
Then you need to figure out a XPath expression that matches the rows and loop over those rows for parsing one item per row. See the following pseudo code:
def parse(self, response):
# Loop over table rows ...
for sel in response.xpath('//table/tr'):
item = PremarketItem()
# Use XPath starting in table row: Use dot at beginning
item['volume'] = sel.xpath('./td[#class="tdVolume"]/text()').extract()
# ... other fields ...
yield item
See scrapy documentation for examples of relative XPath expressions.