Too much memory used when dealing with BeautifulSoup4 in Python 3

Too much memory used when dealing with BeautifulSoup4 in Python 3 - python-3.x

I wrote a script which is fetching the HTML content of a page and analyze it.
Running this code in a loop, where I gave several URLs, I notice the memory usage was growing too much and too quickly.
Profiling and debugging the code with several tools, I notice the problem comes from the bit of code that is using BeautifulSoup4, at least I think.
Line Mem usage Increment Line Contents
59 40.5 MiB 40.5 MiB #profile
60 def crawl(self):
70 40.6 MiB 0.0 MiB self.add_url_to_crawl(self.base_url)
71
72 291.8 MiB 0.0 MiB for url in self.page_queue:
74 267.4 MiB 0.0 MiB if url in self.crawled_urls:
75 continue
76
77 267.4 MiB 0.0 MiB page = Page(url=url, base_domain=self.base_url, log=self.log)
78
79 267.4 MiB 0.0 MiB if page.parsed_url.netloc != page.base_domain.netloc:
80 continue
81
82 291.8 MiB 40.1 MiB page.analyze()
83
84 291.8 MiB 0.0 MiB self.content_hashes[page.content_hash].add(page.url)
94
95 # Add crawled page links into the queue
96 291.8 MiB 0.0 MiB for url in page.links:
97 291.8 MiB 0.0 MiB self.add_url_to_crawl(url)
98
100 291.8 MiB 0.0 MiB self.crawled_pages.append(page.getData())
101 291.8 MiB 0.0 MiB self.crawled_urls.add(page.url)
102
103 291.8 MiB 0.0 MiB mem = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
104 291.8 MiB 0.0 MiB print('Memory usage is: {0} KB'.format(mem))
Here is what line 104 is printing each run:
Memory usage is: 69216 KB
Memory usage is: 92092 KB
Memory usage is: 105796 KB
Memory usage is: 134704 KB
Memory usage is: 158604 KB
Memory usage is: 184068 KB
Memory usage is: 225324 KB
Memory usage is: 248708 KB
Memory usage is: 273780 KB
Memory usage is: 298768 KB
Using tracemalloc in the main file which is calling all the modules and running the crawl method above, I got the following list from tracemalloc.snapshot():
/usr/lib/python3.8/site-packages/bs4/element.py:744: size=23.3 MiB, count=210391, average=116 B
/usr/lib/python3.8/site-packages/bs4/builder/__init__.py:215: size=17.3 MiB, count=335036, average=54 B
/usr/lib/python3.8/site-packages/bs4/element.py:628: size=9028 KiB, count=132476, average=70 B
/usr/lib/python3.8/html/parser.py:327: size=7804 KiB, count=147140, average=54 B
/usr/lib/python3.8/site-packages/bs4/element.py:121: size=6727 KiB, count=132476, average=52 B
/usr/lib/python3.8/site-packages/bs4/element.py:117: size=6702 KiB, count=40848, average=168 B
/usr/lib/python3.8/html/parser.py:324: size=6285 KiB, count=85986, average=75 B
/usr/lib/python3.8/site-packages/bs4/element.py:772: size=5754 KiB, count=105215, average=56 B
/usr/lib/python3.8/html/parser.py:314: size=5334 KiB, count=105196, average=52 B
/usr/lib/python3.8/site-packages/bs4/__init__.py:587: size=4932 KiB, count=105197, average=48 B
Most of the files listed above are under the /bs4/ folder. Now, from the moment the page variable (line 82) is not stored anywhere, page.getData() is returning a dictionary and page.url a string, why do I have BeautifulSoup getting so much space in memory?
In line 72 you can see how the memory usage changed from ~40MB to ~291MB (considering the loop processed 10 URLs), it's a big change considering that the data I'm actually saving are a small dictionary and a string.
Am I having a problem with the garbage collector or I wrote something wrong?
I'm not very practical with Python, so I hope the point I've made with the profiling and debugging is correct.

Related

Memory leak where CPython extension returns a 'PyList_New' instance to Python, which is never deallocated

I've been trying to debug a memory leak for a few days and I'm running out of ideas.
High-level: I've written a CPython extension that allows querying against binary data files, and it returns the results as a Python list of objects. Usage is similar to this psuedocode:
for config in configurations:
s = Strategy(config)
for date in alldates:
data = extension.getData(date)
# do analysis on 'data', capture/save statistics
I've used tracemalloc, memory_profiler, objgraph, sys.getrefcount, and gc.get_referrers to try to find the root cause, and these tools all point to this extension as source of an exorbitant amount of memory (many gigs). For context, a single record in the binary file is 64 bytes, there are typically 390 records per day, so each date iteration is working with ~24K bytes. Now, there are many iterations happening (synchronously), but in each iteration the data is used as a local variable, so I expected each subsequent assignment to deallocate the previous object. The results from memory_profile suggest otherwise...
Line # Mem usage Increment Occurences Line Contents
============================================================
86 33.7 MiB 33.7 MiB 1 #profile
87 def evaluate(self, date: int, filterConfidence: bool, limitToMaxPositions: bool, verbose: bool) -> None:
92 112.7 MiB 0.0 MiB 101 for symbol in self.symbols:
93 111.7 MiB 0.0 MiB 100 fromdate: int = TradingDays.getAdjacentDay(date, -(self.config.analysisPeriod - 1))
94 111.7 MiB 0.0 MiB 100 throughdate: int = date
95
96 111.7 MiB 0.0 MiB 100 maxtime: int = self.config.maxTimeToGain
97 111.7 MiB 0.0 MiB 100 target: float = self.config.profitTarget
98 111.7 MiB 0.0 MiB 100 islong: bool = self.config.isLongStrategy
99
100 111.7 MiB 0.8 MiB 100 avgtime: Optional[int] = FileStore.getAverageTime(symbol, maxtime, target, islong, fromdate, throughdate, verbose)
101 111.7 MiB 0.0 MiB 100 if avgtime is None:
102 110.7 MiB 0.0 MiB 11 continue
103
104 112.7 MiB 78.3 MiB 89 weightedModel: WeightedModel = self.testAverageTimes(symbol, avgtime, fromdate, throughdate)
105 112.7 MiB 0.0 MiB 89 if weightedModel is not None:
106 112.7 MiB 0.0 MiB 88 self.watchlist.append(weightedModel)
107 112.7 MiB 0.0 MiB 88 self.averageTimes[symbol] = avgtime
108
109 112.7 MiB 0.0 MiB 1 if verbose:
110 print('\nFull Evaluation Results')
111 print(self.getWatchlistTableString())
112
113 112.7 MiB 0.0 MiB 1 self.watchlist.sort(key=WeightedModel.sortKey, reverse=True)
114
115 112.7 MiB 0.0 MiB 1 if filterConfidence:
116 112.7 MiB 0.0 MiB 91 self.watchlist = [ m for m in self.watchlist if m.getConfidence() >= self.config.winRate ]
117
118 112.7 MiB 0.0 MiB 1 if limitToMaxPositions:
119 self.watchlist = self.watchlist[:self.config.maxPositions]
120
121 112.7 MiB 0.0 MiB 1 return
This is from the first iteration of the evaluate function (there are 30 iterations total). Line 104 is where it seems to be accumulating memory. What's strange is that the weightedModel contains only basic stats about the data queried, and that data is stored in a loop-local variable. I can't figure out why the memory used is not cleaned up after each inner iteration.
I've tried to del the objects in question after an iteration completes, but it has no effect. The refcount does seem high for the containing objects, and gc.get_referrers shows an object as referring to itself (?).
I'm happy to provide additional information/code, but I've tried so many things at this point a braindump would be a complete mess :) I'm hoping someone with more experience might be able to help me focus my thought process.
Cheers!

Found it! The leak was one layer deeper, where the extension function builds an instance of a Python object.
This was the leaky version:
PyObject* obj = PyObject_CallObject(PRICEBAR_CLASS_DEF, args);
PyObject_SetAttrString(obj, "id", PyLong_FromLong(bar->id));
# a bunch of other attrs...
return obj;
This is the fixed version:
PyObject* obj = PyObject_CallObject(PRICEBAR_CLASS_DEF, args);
PyObject* id = PyLong_FromLong(bar->id);
# others...
PyObject_SetAttrString(obj, "id", id);
# others...
Py_DECREF(id);
# others...
return obj;
For some reason I had it in my head that the PyLong_FromLong function did NOT increment the ref count of the resulting object, but this is apparently not true. This is how I wound up with an extra reference count for every bar object that was created.

Memory usage of List vs Generator seems almost same. why?

I have the following python code. I am trying to understand python generators. If my understanding is correct the print_list will take much more memory than print_generator. I am using memory profiler to profile the two functions below.
from memory_profiler import profile
import logging
my_list = [i for i in range(100000)]
my_generator = (i for i in range(1000000))
#profile
def print_generator():
try:
while True:
item = next(my_generator)
logging.info(item)
except StopIteration:
pass
finally:
print('Printed all elements')
#profile
def print_list():
for item in my_list:
logging.info(item)
pass
logging.basicConfig(filename='app.log', filemode='w', format='%(name)s - %(levelname)s - %(message)s')
print_list()
print_generator()
The result of the profiling is pasted below.
Memory usage for the generator.
Line # Mem usage Increment Occurences Line Contents
============================================================
10 23.0 MiB 23.0 MiB 1 #profile
11 def print_generator():
12 23.0 MiB 0.0 MiB 1 try:
13 23.0 MiB 0.0 MiB 1 while True:
14 23.0 MiB -26026.5 MiB 1000001 item = next(my_generator)
15 23.0 MiB -26026.5 MiB 1000000 logging.info(item)
16 23.0 MiB -0.1 MiB 1 except StopIteration:
17 23.0 MiB 0.0 MiB 1 pass
18 finally:
19 23.0 MiB 0.0 MiB 1 print('Printed all elements')
Memory usage for the list
Line # Mem usage Increment Occurences Line Contents
============================================================
22 23.0 MiB 23.0 MiB 1 #profile
23 def print_list():
24 23.0 MiB 0.0 MiB 100001 for item in my_list:
25 23.0 MiB 0.0 MiB 100000 logging.info(item)
26 23.0 MiB 0.0 MiB 100000 pass
The memory usage for the list and the generator seems almost idendical.
So what am I missing here?. Why is generator using less memory?

How to reduce/free memory when using xarray datasets?

This is the output of the memory profiler of a function in my code, using xarray (v.0.16.1) datasets:
Line # Mem usage Increment Line Contents
================================================
139 94.195 MiB 94.195 MiB #profile
140 def getMaps(ncfile):
141 335.914 MiB 241.719 MiB myCMEMSdata = xr.open_dataset(ncfile).resample(time='3H').reduce(np.mean)
142
143 335.945 MiB 0.031 MiB plt.figure(figsize=(20.48, 10.24))
144
145 # projection, lat/lon extents and resolution of polygons to draw
146 # resolutions: c - crude, l - low, i - intermediate, h - high, f - full
147 336.809 MiB 0.863 MiB map = Basemap(projection='merc', llcrnrlon=-10.,
148 335.945 MiB 0.000 MiB llcrnrlat=30., urcrnrlon=36.5, urcrnrlat=46.)
149
150
151 339.773 MiB 2.965 MiB X, Y = np.meshgrid(myCMEMSdata.longitude.values,
152 336.809 MiB 0.000 MiB myCMEMSdata.latitude.values)
153 348.023 MiB 8.250 MiB x, y = map(X, Y)
154
155 # reduce arrows density (1 out of 15)
156 348.023 MiB 0.000 MiB yy = np.arange(0, y.shape[0], 15)
157 348.023 MiB 0.000 MiB xx = np.arange(0, x.shape[1], 15)
158 348.023 MiB 0.000 MiB points = np.meshgrid(yy,xx)
159
160 #cycle time to save maps
161 348.023 MiB 0.000 MiB i=0
162 742.566 MiB 0.000 MiB while i < myCMEMSdata.time.values.size:
163 742.566 MiB 305.996 MiB map.shadedrelief(scale=0.65)
164 #waves height
165 742.566 MiB 0.000 MiB waveH = myCMEMSdata.VHM0.values[i, :, :]
166 742.566 MiB 0.000 MiB my_cmap = plt.get_cmap('rainbow')
167 742.566 MiB 0.043 MiB map.pcolormesh(x, y, waveH, cmap=my_cmap, norm=matplotlib.colors.LogNorm(vmin=0.07, vmax=4.,clip=True))
168 # waves direction
169 742.566 MiB 0.000 MiB wDir = myCMEMSdata.VMDR.values[i, :, :]
170 742.566 MiB 0.242 MiB map.quiver(x[tuple(points)],y[tuple(points)],np.cos(np.deg2rad(270-wDir[tuple(points)])),np.sin(np.deg2rad(270-wDir[tuple(points)])),
171 742.566 MiB 0.000 MiB edgecolor='lightgray', minshaft=4, width=0.007, headwidth=3., headlength=4., linewidth=.5)
172 # save plot
173 742.566 MiB 0.000 MiB filename = pd.to_datetime(myCMEMSdata.time[i].values).strftime("%Y-%m-%d_%H")
174 742.566 MiB 0.086 MiB plt.show()
175 742.566 MiB 39.406 MiB plt.savefig(TEMPDIR+filename+".jpg", quality=75)
176 742.566 MiB 0.000 MiB plt.clf()
177 742.566 MiB 0.000 MiB del wDir
178 742.566 MiB 0.000 MiB del waveH
179 742.566 MiB 0.000 MiB i += 1
180
181 #out of loop
182 581.840 MiB 0.000 MiB plt.close("all")
183 581.840 MiB 0.000 MiB myCMEMSdata.close()
184 441.961 MiB 0.000 MiB del myCMEMSdata
As you can see the allocated memory is not freed up, and after many runs of the program, it simply fails ("Killed") for low memory.
How can I free the memory allocated by the dataset? I am using either dataset.close() and deleting the variable, with no success.

Why is docker stats CPU Percentage greater than 100 times number of cores

I have an Azure VM with 2 cores. From my understanding, the CPU % returned by docker stats can be greater than 100% if multiple cores are used. So, this should max out at 200% for this VM. However, I get results like this with CPU % greater than 1000%
CONTAINER CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
545d4c69028f 3.54% 94.39 MiB / 6.803 GiB 1.35% 3.36 MB / 1.442 MB 1.565 MB / 5.673 MB 6
008893e3f70c 625.00% 191.3 MiB / 6.803 GiB 2.75% 0 B / 0 B 0 B / 24.58 kB 35
f49c94dc4567 0.10% 46.85 MiB / 6.803 GiB 0.67% 2.614 MB / 5.01 MB 61.44 kB / 0 B 31
08415d81c355 0.00% 28.76 MiB / 6.803 GiB 0.41% 619.1 kB / 3.701 MB 0 B / 0 B 11
03f54d35a5f8 1.04% 136.5 MiB / 6.803 GiB 1.96% 83.94 MB / 7.721 MB 0 B / 0 B 22
f92faa7321d8 0.15% 19.29 MiB / 6.803 GiB 0.28% 552.5 kB / 758.6 kB 0 B / 2.798 MB 7
2f4a27cc3e44 0.07% 303.8 MiB / 6.803 GiB 4.36% 32.52 MB / 20.27 MB 2.195 MB / 0 B 11
ac96bc45044a 0.00% 19.34 MiB / 6.803 GiB 0.28% 37.28 kB / 12.76 kB 0 B / 3.633 MB 7
7c1a45e92f52 2.20% 356.9 MiB / 6.803 GiB 5.12% 86.36 MB / 156.2 MB 806.9 kB / 0 B 16
0bc4f319b721 14.98% 101.8 MiB / 6.803 GiB 1.46% 138.1 MB / 64.33 MB 0 B / 73.74 MB 75
66aa24598d27 2269.46% 1.269 GiB / 6.803 GiB 18.65% 1.102 GB / 256.4 MB 14.34 MB / 3.412 MB 50
I can verify there are only two cores:
$ grep -c ^processor /proc/cpuinfo
2
The output of lshw -short is also confusing to me:
H/W path Device Class Description
=====================================================
system Virtual Machine
/0 bus Virtual Machine
/0/0 memory 64KiB BIOS
/0/5 processor Intel(R) Xeon(R) CPU E5-2673 v3 # 2.40GHz
/0/6 processor Xeon (None)
/0/7 processor (None)
/0/8 processor (None)
/0/9 processor (None)
/0/a processor (None)
/0/b processor (None)
/0/c processor (None)
/0/d processor (None)
/0/e processor (None)
/0/f processor (None)
/0/10 processor (None)
...
with well over 50 processors listed

For your first question, I would suggest you to submit an issue on this page.
The output of lshw -short is also confusing to me:
If you omit the "-short" parameter, you will find that all of the "processor (None)" is in the state of DISABLED.

where is memory missing in top?

Here is an output from top (sorted by %Mem):
Mem: 5796624k total, 4679932k used, 1116692k free, 317652k buffers
Swap: 0k total, 0k used, 0k free, 1734160k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
13169 storm 20 0 3279m 344m 16m S 0.7 6.1 201:38.40 java
5463 storm 20 0 2694m 172m 14m S 0.0 3.0 72:38.49 java
5353 storm 20 0 2561m 155m 14m S 0.0 2.7 30:20.43 java
13102 app 20 0 3813m 80m 17m S 0.3 1.4 132:37.16 java
13147 storm 20 0 3876m 65m 16m S 0.0 1.2 23:21.73 java
3081 named 20 0 230m 16m 2652 S 0.0 0.3 1:22.81 named
29773 root 20 0 318m 10m 3576 S 0.0 0.2 5:59.41 logstash-forwar
5345 root 20 0 193m 10m 1552 S 0.0 0.2 12:24.21 supervisord
1048 root 20 0 249m 5200 1068 S 0.0 0.1 0:22.55 rsyslogd
21774 root 20 0 99968 3980 3032 S 0.0 0.1 0:00.00 sshd
3456 postfix 20 0 81108 3432 2556 S 0.0 0.1 0:02.83 qmgr
3453 root 20 0 80860 3416 2520 S 0.0 0.1 0:19.40 master
In GBs:
Mem: 5.8g total, 4.7g used, 1.1g free, 0.3g buffers
So free mem is 1.1 / 5.8 ~ 19%
Where as if we add the top %Mem, we see the used is about: 6.1+3.0+2.7+1.4+1.2+0.3+... ~ 16% and that means the free should be about 84%
Why dont the numbers match (19% vs 84%)?

From the memory usage related lines in top:
Mem: 5796624k total, 4679932k used, 1116692k free, 317652k buffers
Swap: 0k total, 0k used, 0k free, 1734160k cached
Total memory equals the sum of used and free memory. Used, on the other hand, is the sum of "really used by applications" and cached and buffers. So, in your case goes like this:
Mem = 5796624k = 4679932k + 1116692k;
"Really used by applications" = Used - (cached + buffers)
= 4679932k - (1734160k + 317652k )
= 2628120k.
So total memory is 5.8g and 2.6g is really used by applications. Since, 1.1g is free which means 5.8g - (1.1g + 2.6g) = 2.1g memory is cached which improves performance. In the very moment an application requires part of the cached memory it is immediately given to it. That's why your computation of free memory in percentage of total memory is not matching as you were expecting!

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Too much memory used when dealing with BeautifulSoup4 in Python 3 - python-3.x

Related

Memory leak where CPython extension returns a 'PyList_New' instance to Python, which is never deallocated

Memory usage of List vs Generator seems almost same. why?

How to reduce/free memory when using xarray datasets?

Why is docker stats CPU Percentage greater than 100 times number of cores

where is memory missing in top?

Categories

Resources