I wanted to share a little "war story" from today. It's real!
The guest post by Sam
and the corresponding Reddit post
draw quite some traffic!! Which is great;)
E.g. from the screenshot attached, you see 50 users online
accross the Autobahn/WAMP/Crossbar/Tavendo web site.
You can see their activity live and real-time (down to who moves his mouse on a single Web page;) .. that is on Clandeck - and app based on Crossbar:
Clandeck, as well as the Crossbar demos run on 1 Crossbar instance.
You can see CPU load at some 7%. No problem.
This Crossbar instance is running on a EC2 medium instance. Low end gear.
The static content for all those site is on Amazon S3. Practically unbreakable. No matter what. And should traffic spike, we could easily
activate Amazon CloudFront. Anyway.
BUT: what happend just an hour ago was this: freeze. Crossbar demos. Clandeck. Nothing works anymore.
We have Pingdom setup, so I got an immediate email (5 min delay).
Logging into the host via SSH doesn't work anymore. ;( Shit.
Alright. Hard reboot via EC2 console.
Now have a look at the log file of Crossbar:
2014-06-05 19:08:57+0000 [Router 21392] Fatal Python error: GC object already tracked
.. server outage .. pingdom alert .. hard reboot ..
2014-06-05 19:33:40+0000 [Controller 1054] Log opened.
2014-06-05 19:33:40+0000 [Controller 1054] ============================== Crossbar.io ==============================
As you can see from
"Fatal Python error: GC object already tracked"
Python (CPython 2.7.7) crashed altogether!! First time I saw this. It's rare.
Thing is: you cannot possibly make Python crash from Python code. This is extremely unlikely, and if so, it's a bug in Python itself. Which means: Crossbar code isn't at fault.
However: Googling turns up: it's highly likely that some Python C-Extension does bad things. Writing Python C-Exts correctly is quite hard.
Now what? How to find and isolate the bad guy?
The Crossbar demos basically only need a Router.
But the Clandeck backend consists of a whole bunch of WAMP components (all Python). And those use Python C-Extensions - e.g. sqlite.
FWIW: Crossbar itself also uses sqlite (to store authenticate cookies).
And I will remove that!! Since I have that feeling that sqlite and/or the C-PyExt for that might be the evil guy (https://github.com/crossbario/crossbar/issues/64).
Sidenote: do NOT write C-Python extensions. You likely get it wrong. Use cffi.
But what Crossbar now allowed me to do, without changing any code, just the config:
Isolate the different parts into different processes!
Which means: should one of the guys freeze, it doesn't take down everything. I can shoot the single bad process. No problem. Crossbar allows me to restart workers dynamically.
For me that's a real world proof of the viability and usefulness of a multi-process architecture (which Crossbar implements).
Isolate things. Tame those each on their own. Etc
Compare the first screenshot, there is 1 process: "Clandeck allinone"
with the 2nd screenshot, where you can see 6 processes.
Again: I did not change 1 line of code. Only the config.
For those interested, I attached both config files.
config_prod_full_allinone.json (8.07 KB)
config_prod_full.json (9.87 KB)