I'm trying to compare (for performance) the use of either dataURIs compared to a large number of images. What I've done is setup two tests:
Regular Images (WPT)
Base64 (WPT)
Both pages are exactly the same, other than "how" these images/resources are being offered. I've ran a WebPageTest against each (noted above - WPT) and it looks the average load time for base64 is a lot faster -- but the cached view of regular view is faster. I've implemented HTML5 Boilerplate's .htaccess to make sure resources are properly gzipped, but as you can see I'm getting an F for base64 for not caching static resources (which I'm not sure if this is right or not). What I'm ultimately trying to figure out here is which is the better way to go (assuming let's say there'd be that many resources on a single page, for arguments sake). Some things I know:
The GET request for base64 is big
There's 1 resource for base64 compared to 300 some-odd for the regular (which is the bigger downer here... GET request or number of resources)? The thing to remember about the regular one is that there are only so many resources that can be loaded in parallel due to restrictions -- and for base64 - you're really only waiting until the HTML can be read - so nothing is technically loaded than the page itself.
Really appreciate any help - thanks!
For comparison I think you need to run a test with the images sharded across multiple hostnames.
Another option would be to sprite the images into logical sets.
If you're going to go down the BASE64 route, then perhaps you need to find a way to cache them on the client.
If these are the images you're planning on using then there's plenty of room for optimisation in them, for example: http://yhmags.com/profile-test/img_scaled15/interior-flooring.jpg
I converted this to a PNG and ran it through ImageOptim and it came out as 802 bytes (vs 1.7KB for the JPG)
I'd optimise the images and then re-run the tests, including one with multiple hostnames.
Related
i would like to implement a lambda in aws which receives as input pixel coordinates (x/y), retrieve that pixel's RGB from one image, and then do something with it.
the catch now is that the image is very large: 21600x10800 pixels (a 684MB tif file).
Many of the image's pixels will likely never be accessed (its a world map so it includes e.g. oceans, for which no lambda calls will happen. But i don't know which pixels will be needed.)
The result of the lambda will be persisted so that the image operation is only done once per pixel.
My main concern is that i would like to avoid large unnecessary processing time and costs. I expect multiple calls per second of the lambda. The naive way would be to throw the image into an s3 bucket, then read it in the lambda to get one pixel - but i would think that then each lambda invoke would become very heavy. I could do some custom solution such as storing the rows separately but was wondering if there is some set of technologies that handles it more elegant.
Right now i am using Node.js 14.x but that's not a strong requirement.
the image is in tif format but i could convert it to another image format beforehand if needed. (just not to the answer of the lambda as that is even bigger)
How can i efficiently design this lambda?
As I said in the comments, I think Lambda is the wrong solution unless your traffic is very bursty. If you have continuous traffic with "multiple calls per second," it will be more cost-effective to use an alternate technology, such as EC2 or ECS. And these give you far more control over storage.
However, if you're set on using Lambda, then I think the best solution is to store the file on an EFS volume, then mount that filesystem onto your Lambda. In my experiments, it takes roughly 150 ms for a Lambda to read an arbitrary byte from a file on EFS, using Python and the mmap package.
Of course, if your TIFF library attempts to read the file into memory before performing any operations, this is moot. The TIFF format is designed so that shouldn't be necessary, but some libraries take the easy way out (because in most cases you're displaying or transforming the entire image). You may need to pre-process your file, to produce a raw byte format, in order to be able to make single-pixel gets.
Thanks everyone for the useful information!
so after some testing i settled for the solution from luk2302's comment with 100x100 pixel sub-images hosted on s3, but can't flag a comment as the solution. My tests showed that the lambda operates within 110ms to access a pixel (from the now only 4kb large files) which i think is quite sufficient for my expectations. (the very first request was at 1s time, but now even requests to sub-images which have never been touched before are answered within 110ms)
Parsifal's solution is what i originally envisioned to be ideal in order to really only check the relevant data (open question then being which image library actually omits loading the entire file) but i don't have the means to check the file system aspect more closely if that has more potential. In my case indeed the requests are very much burst driven (with long periods of expected inactivity after the bursts), so for the moment i will remain with a lambda but will keep the mentioned alternatives in mind.
I am writing a python program that uses beautifulsoup to scrape the image link off a website and then categorize the image. The website puts their images on separate pages in the given url format:
(website.com/(a-z)(a-z)(0-9)(0-9)(0-9)(0-9)
This means the the number of url possibilities are very high (+1 million). I am afraid that if I do a get request to the site this many times, it might harm the site or put me in legal danger. How can I scrape the most amount of urls without damaging the site or putting myself in legal trouble? Please let me know if you guys would want anymore information. Thank you!
P.S. I have left a psudocode of what my code does below if that helps.
P.S.S. Sorry if the format is weird or messed up, I am posting from mobile
For url in urlPossibilities:
Request.get(url)
UrlLink = FindImgLink(url)
Categorize(urlLink)
A few options I can think of...
1) Is there a way to get a listing of these image URLs? E.g. a site map, or a page with a large list of them. This would be the preferred way as by using that listing you can then only scrape what you know to exist. Based on your question I feel this is unlikely but if you have one URL is there no way to work backwards and find more?
2) Is there a pattern to the image naming? The letters might be random but the numbers might incrementally count up. E.g. AA0001 and AA0002 might exist but there may be no other images for the AA prefix?
3) Responsible scraping - if the naming within that structure truly is random and you have no option but try all URLs till you get a hit do so responsibly. Respect robot.txt's and limit the rate of requests.
In our company, we have to deal with a lot of user uploads, for example images and videos. Now I was wondering: how do you guys "deal with that" in terms of safety? Is it possible for an image to contain malicious content? Of course, there are the "unwanted" pixels, like porn or something. But that's not what I mean now. I mean images which "break" machines while they are being decoded, etc. I already saw this: How can a virus exist in an image.
Basically I was planning to do this:
Create a DMZ
Store the assets in a bucket (we use GCP here) which lives inside the DMZ
Then apply "malicious code"-detection on the file
If it turns out to be fine... then move the asset into the "real" landscape (the non-dmz)
Now the 3rd part... what can I do here?
Applying a virus scanner
No problem with this, there are a lot of options here. Simple approach and good chance that viruses are being caught.
Do mime-type detection
Based on the first few bytes, I do a mime type detection. For example, if someone sends us a "image.jpg" but in fact its an executable, then we would detect this. Right? Is this safe enough? I was thinking about this package.
What else???
Now... what else can I do? How do other big companies do this? I'm not really looking for answers in terms of orchestration, etc. I know how to use a DMZ, link it all together with a few pubsub topics, etc. I'm purely interested in what techniques to apply to really find out that an incoming asset is "safe".
What I would suggest is to not to do it outside the DMZ , let this be within your DMZ and it should have all the regular security controls as any other system will have within your data center.
Besides the things ( Virus Scan , Mime - Type detection ) that you have outlined , i would suggest a few additional checks to perform.
Size Limitation - You would not want anyone to just bloat out all the space and choke your server.
Throttling - Again you may want to control the throughput , at least have the ability to limit to some maximum value.
Heuristic Scan - Perhaps add a layer to the Anti Virus to do heuristics as well rather than simple signature scans.
File System Access Control - Make sure that the file system access control is foolproof , even in case something malicious comes in it should be able to propagate out to other folders / paths .
Network control - Make sure all the outbound connections are fire walled as well , just in case anything tries to make outward connections.
I am learning directx. It provides a huge amount of freedom in how to do things, but presumably different stategies perform differently and it provides little guidance as to what well performing usage patterns might be.
When using directx is it typical to have to swap in a bunch of new data multiple times on each render?
The most obvious, and probably really inefficient, way to use it would be like this.
Stragety 1
On every single render
Load everything for model 0 (textures included) and render it (IASetVertexBuffers, VSSetShader, PSSetShader, PSSetShaderResources, PSSetConstantBuffers, VSSetConstantBuffers, Draw)
Load everything for model 1 (textures included) and render it (IASetVertexBuffers, VSSetShader, PSSetShader, PSSetShaderResources, PSSetConstantBuffers, VSSetConstantBuffers, Draw)
etc...
I am guessing you can make this more efficient partly if the biggest things to load are given dedicated slots, e.g. if the texture for model 0 is really complicated, don't reload it on each step, just load it into slot 1 and leave it there. Of course since I'm not sure how many registers there are certain to be of each type in DX11 this is complicated (can anyone point to docuemntation on that?)
Stragety 2
Choose some texture slots for loading and others for perpetual storage of your most complex textures.
Once only
Load most complicated models, shaders and textures into slots dedicated for perpetual storage
On every single render
Load everything not already present for model 0 using slots you set aside for loading and render it (IASetVertexBuffers, VSSetShader, PSSetShader, PSSetShaderResources, PSSetConstantBuffers, VSSetConstantBuffers, Draw)
Load everything not already present for model 1 using slots you set aside for loading and render it (IASetVertexBuffers, VSSetShader, PSSetShader, PSSetShaderResources, PSSetConstantBuffers, VSSetConstantBuffers, Draw)
etc...
Strategy 3
I have no idea, but the above are probably all wrong because I am really new at this.
What are the standard strategies for efficient rendering on directx (specifically DX11) to make it as efficient as possible?
DirectX manages the resources for you and tries to keep them in video memory as long as it can to optimize performance, but can only do so up to the limit of video memory in the card. There is also overhead in every state change even if the resource is still in video memory.
A general strategy for optimizing this is to minimize the number of state changes during the rendering pass. Commonly this means drawing all polygons that use the same texture in a batch, and all objects using the same vertex buffers in a batch. So generally you would try to draw as many primitives as you can before changing the state to draw more primitives
This often will make the rendering code a little more complicated and harder to maintain, so you will want to do some profiling to determine how much optimization you are willing to do.
Generally you will get better performance increases through more general algorithm changes beyond the scope of this question. Some examples would be reducing polygon counts for distant objects and occlusion queries. A popular true phrase is "the fastest polygons are the ones you don't draw". Here are a couple of quick links:
http://msdn.microsoft.com/en-us/library/bb147263%28v=vs.85%29.aspx
http://www.gamasutra.com/view/feature/3243/optimizing_direct3d_applications_.php
http://http.developer.nvidia.com/GPUGems2/gpugems2_chapter06.html
Other answers are better answers to the question per se, but by far the most relevant thing I found since asking was this discussion on gamedev.net in which some big title games are profiled for state changes and draw calls.
What comes out of it is that big name games don't appear to actually worry too much about this, i.e. it can take significant time to write code that addresses this sort of issue and the time it takes to spend writing code fussing with it probably isn't worth the time lost getting your application finished.
I need to develop an IFilter for Microsoft Search Server 2008 that performs prolonged computations to extract text. Extracting text from one file can take from 5 seconds to 12 hours. How can I desing such an IFilter so that the daemon doesn't reset it on timeout and also other IFilters can be reset on timeout if they hang up?
12 hours, wow!
If it takes that long and there are many files, your best option would be to create a pre-processing application that would extract the text and make it available for the iFilter to access.
Another option would be to create html summaries of the documents and instruct the crawler to index those. If the summary page could easily link to the document itself if necessary.
I have not actually developed any filters yet, so I'm basically just guessing, but the way I always understood things is that the IFilter is chunk-based for exactly this reason. It's up to the filter implementation to make sure the returned chunks are "small enough", so the calling search daemon can simply quit in between two chunks if things are taking too long.
Apparently, my assumption is wrong, or you would not be asking this very question.