Some background to understand our issue better
We are working on a new website with a great emphasis on images. We already have www.333travel.nl which of course being a travel product already heavily relies on images. Our new website has even higher res images loaded because there is a great emphasis on the luxury feel. This site is aimed towards the luxury traveller www.travelplatinum.nl. The project that served our images static.333travel.nl is already a few years running with no reported issues for the 333travel website.
The issue
The issue at hand got reported from our home workers. They noted that when working from home a lot of the images displayed were sometimes broken on TravelPlatinum (and a refresh mostly fixed them). They also replied that it was all working fine when working from within the office. So that made us start a quest in finding the culprit.
We noticed that in the browsers we use (Edge and Chrome) there were reports from the failed images in the console which reported. ERR_HTTP2_PROTOCOL_ERROR. With that message in the pocket we started our search for the issue searching through the logs on the server. We are running Nginx as our webserver.
We quickly found a trail that appeared for every image that we thought didn’t load correctly. Since we weren’t able to simulate the issue ourselves yet.
0.0.0.0 - - [06/Nov/2023:15:45:02 +0100] "GET /web-images/10/2023/653775730fc66/Jazan---Al-qahar-mountains-1200.webp HTTP/2.0" 206 183064 "https://www.travelplatinum.nl/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36 Edg/119.0.0.0"
Pay attention to the 206 call here. This means partial content loaded.
https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/206
This was odd since our webserver didn’t support partial loading of images. We never send headers that replied with something like a range which are required to be able to do partial content.
Our attempt to fix the issue
So that was the first issue we dove in. I found out that Nginx by default sends a header that replies Content-Type: bytes. I don’t know why I would want our image server to respond to a 206 request with an ok message. So we disabled this with the following setting in our server block.
server {
server_name static.333travel.nl;
max_ranges 0;
}
We replied back to the persons experiencing this issue to report to us if it happened again. Which sadly was quickly the case. The error message did change in our logging. Edge and chrome kept spitting out the same useless ERR_HTTP2_PROTOCOL_ERROR in the console tab. In the timeline we also saw these images stopping to load at a certain point.
0.0.0.0 - - [08/Nov/2023:14:22:27 +0100] "GET /web-images/06/2020/5ed7ad48d3d79/mexico-yucatan-chichen-itza-ik-kil-cenote.jpeg HTTP/2.0" 200 0 "https://vue-boss.333office.nl/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36 Edg/119.0.0.0"
We now get a header with a length of 0 bytes and a response with http 200 (ok). This is getting real strange here. The header when inspected in the browser returns the right amount of bytes in the response header. So why does the log of Nginx state it replies with a size of 0. This is where things got weird and we didn’t really know what to do next. We first had to be able to reproduce the issue ourself so we could really start to troubleshoot.
I tried to simulate a fault with ab (apache benchmark). I started conservative with 500 requests and 50 concurrent requests. The issue never triggered for me. i then moved in small steps upwards to end on 10.000 requests with 700 concurrent requests without ever triggering the reported issue. The server load also appears to handle everything fine. It was an image of around 500kb i requested there.
The Facts
- Setting a slow profile in the network tab when inspecting didn’t simulate the issue for us.
- The error was still only reported working from home from 2 users.
- When in the office they haven’t had a single issue.
- Loading the site 333travel.nl didn’t report the issue at all. Every image worked there also for the 2 persons working from home.
- Using a 5g connection testing from all various places we could think off didn’t simulate the issue once for me.
- Using ab (apache benchmark) doing 10.000 requests with 700 concurrent requests never triggered the issue at hand.
There was still one difference and that was the domain serving the images. 333travel got replied with a Cross origin header of same_site which of course is true since the domain is the same. So I added a header to the static.
server {
server_name static.333travel.nl;
max_ranges 0;
add_header 'Access-Control-Allow-Origin' '*';
}
This way every source wanting to use images from our static.333travel.nl gets the same reply. This didn’t solve anything about the issue reported.
We still experienced issues to simulate the issue ourself. We were pretty sure that this only got triggered on a somehow bad connection visiting TravelPlatinum. Visiting 333travel which served the images from the same static server never showed the issue we are experiencing on TravelPlatinum.
Simulating a bad WiFi connection
We badly wanted to simulate a bad WiFi connection. I moved myself to a spot where there was bad WiFi outside of our office and I started to finally see the issue for myself. Finally something to start our testing on. While it was still odd we needed a somehow bad/slow WiFi to simulate this. You would say setting your profile to slow 3g in the browsers developer tools would behave the same.
Now we finally were able to start testing from our development environment. We started to suspect the project itself doing something weird with maybe something javascript related closing the connection or something. So we started to create a stripped down website from TravelPlatinum to test without javascript.
The no image loading issue was still there..
We decided to take in the exact same html from 333travel.nl to test with that to create a same environment. Since 333travel still was going strong every refresh.
The no image loading issue was still there..
We badly wanted to be able simulate this even better since our bad WiFi spot wasn’t at a desk or something, we were sitting on a bunch of stairs debugging the issue from there. I decided to create a 2nd WiFi hotspot in our office with a throttled connection of 3mbit and see where that takes us. This was it! We could now simulate every refresh a lot of broken images on TravelPlatinum! We are getting somewhere at least. Bear in mind that the 333travel website still was loading every image with no issues.
We moved away from the nginx logs at this moment in time since we uncovered that going in that direction nothing was going to be solved at this moment. Nginx still replied an ok on a header with 0 bytes. If that’s ok for nginx with the right headers in place, that message is not going to help us solve the issue at this moment.
Testing with Firefox
The moment I created this WiFi hotspot we were just about ready to test images with just loading the html image tag and nothing more loaded in the project. This still broke our images. I then suggested to fire up Firefox just to see if that behaved differently since I suspected it to be a browser related thing on how it responds on what is happening. Or at least it could output maybe a different error. Something we could focus our search for the solution on.
We saw that Firefox was indeed loading the images line by line on the 3mbit connection. But it suddenly stopped loading leaving broken images in place which were about 50% loaded or not at all. This was finally something of a change to see! Still Edge/Chrome displayed no image at all and gave me a useless error message. I now finally started to realize that the static was actually sending the images and the browser received them but the connection got closed to soon somehow.
The solution
And that was when I finally found the issue at hand. A header i already set ages ago on a global level in Nginx in a performance tuning session.
send_timeout 2;
What send timeout in my opinion does is keeping the connection open for X seconds after not receiving data. The default is 60 seconds and I adjusted it to 2 seconds so we would not have open connections for a whole 60 seconds doing nothing.
Why would that be the issue since I clearly saw Firefox visually loading the image line by line? There is constant data being send then right? Why would this setting then be related to closing the connection.
Syntax: |
send_timeout time; |
Default: |
send_timeout 60s; |
Context: |
http, server, location |
Sets a timeout for transmitting a response to the client. The timeout is set only between two successive write operations, not for the transmission of the whole response. If the client does not receive anything within this time, the connection is closed.
Well it appears this does get triggered on slower connections. This is getting interpreted as a stalled request by Nginx. And it closes the whole connection to in our case the static.333travel.nl. So the partially loaded image and every image after that not being loaded wont get loaded again since the connection to static is closed.
Why simulating a slow connection in the developer tools wont trigger this issue is still something I don’t know for sure. I have some theories, a bad WiFi also introduces more latency and lost packets getting resend. But I don’t know the real reason. I dont know how these browsers simulate a slow connection.
Setting this send_timeout to 8 seconds as below solved our issue completely.
server {
server_name static.333travel.nl;
max_ranges 0;
add_header 'Access-Control-Allow-Origin' '*';
send_timeout 8;
}
Conclusion
We have learned a lot, and we solved the issue. The odd thing remains that all this time it only triggered on our TravelPlatinum project. We were obviously balancing on the limit with the send_timeout setting since the image load is greater on TravelPlatinum.
Another important thing to note, setting a slow connection in your developer tools is not a real simulation from a slow connection. You can simulate all you want but you need real life cases to really measure if everything is working fine.