Earlier today I had to check an alarm on the servers. The alarm was for a non-issue, but I happened to notice a momentary delay to replication across multiple services and I wanted to investigate.
The rendering servers fell up to four minutes behind on some servers, with all servers falling at least a minute behind normal. Interestingly, there does not seem to be a correlation between the delay and the server capacity.
The vector tile servers had a more obvious impact with one server falling 12 minutes behind and the other 23 minutes. This increased time is by design. The mod_tile rendering stack has less work to do on updates but it causes more work on future requests for tiles in the updated area. On the other hand, the tilekiln stack has more work to do on updates but subsequent requests to the area are actually slightly faster.
I checked the nominatim lag. The delay is barely discernable so I’m not including the graph.
Checking the minutely replication around the time shows that at 23:13 the replication diffs jumped in size, with 007130004 to 007130010 showing an increased size of 400 kB to 1000 kB. The minutely diffs are normallly under 50 kB.
I examined one of the diffs to see what user edited the most objects. The easiest way was with osmium show 008.osc.gz | grep user | sort | uniq -c | sort -h. This technique isn’t 100% reliable since a tag containing the string “user” would introduce extra results but it’s good enough for a quick manual look.
Immediately it was obvious that this was a revert being done by woodpeck_repair as they had over 100x as many changes than the next user. Looking at the changesets it was a revert of a bad import.
I still wanted to explore how this showed up on the vector tile systems. The changes came in every minute for several minutes. The graph shows
1. Normal updates before the large diffs, with only 60 seconds between updates
2. update 1 took longer, meaning after it happened it was slightly behind
3. update 2 merged the multiple diffs available. The are the diffs that accumulated between where the green and purple lines intersect below 1 and the purple and yellow intersection to the right. It looks like it updated two minute of data (red or blue lines) in about 4 minutes (time between “valleys”)
4. update 3 took those 4 minutes of new data and took 8 minutes. Looking at the diff size this is when the biggest updates happened
5. update 4 consumed those 8 minutes of data and processed them in 5 minutes
6. update 5 consumed the new 5 minutes of data in about 30 seconds
7. updates continued like normal.
Both axis are measured in time and the blue/red lines are how long it took an update to run. If you plotted the data on a graph where the axises were the same scale all the lines would be at 90 or 45 degrees.
The graph is approximate since updates only come out every 60 seconds and the data is updated every 15 seconds. Also it’s made in a drawing program, not properly graphed.
Looking into the logs I can see more detail about what’s taking the time.
The first large diff was picked up at 23:14:08 and
start time
# ways
osm2pgsql
building tiles
expired
tilekiln time
backlog left
23:14:08
72056
72s
11391
17s
145s
23:15:40
161002
141s
12811
11s
234s
23:18:16
482221
426s
14158
14s
492s
23:25:40
277978
269s
13568
14s
357s
23:30:25
1225
1225
53
26s
0s
To my surprise, most of the time was spent in osm2pgsql processing ways, not in tilekiln. This means my strategies to minimize tilekiln processing are working.
Тональность 0
Информативность 0
www.openstreetmap.org