Python pubsub stress test client

#1

Hi Tobias et al,

I’ve written a proof-of-concept stress test client and server that use Autobahn’s PubSub (https://github.com/nickfishman/autobahn-tests). The goal was to test how many distinct clients and messages per second a single Autobahn instance can handle (and to write some more interesting Autobahn code, which has been very fun). There are several components:

  • server.py: a basic autobahn PubSub server, straight from the examples
  • stressclient.py: the interesting stuff, with two kinds of clients:
  • SenderClient: publishes a bunch of messages to a topic
  • MonitorClient: subscribes to a topic, receives all the messages sent by senders, and outputs statistics upon receiving those messages

I’d appreciate feedback about this implementation. In particular, there are a few issues I’ve encountered:

Connection failures

The default configuration tries to connect 500 sender clients and have them publish 100 messages each. This usually succeeds (all messages get delivered and received by the monitor), but sometimes the clients encounter connection errors.

Here’s an example of running 2 stress test clients, each publishing on separate topics:

terminal1$ python pubsub/server.py

2013-10-29 02:30:08-0700 [-] Log opened.

2013-10-29 02:30:08-0700 [-] WampServerFactory starting on 9000

2013-10-29 02:30:08-0700 [-] Starting factory <autobahn.wamp.WampServerFactory instance at 0x97c916c>

2013-10-29 02:30:08-0700 [-] Site starting on 8080

2013-10-29 02:30:08-0700 [-] Starting factory <twisted.web.server.Site instance at 0x97c98cc>

terminal2$ python pubsub/stressclient.py -i 1000

terminal3$ python pubsub/stressclient.py -t http://autobahn-pubsub/channels/2/stress -i 1000 -r 60

BATCH COMPLETED

0:00:09.006647

STATE: DONE

MESSAGE DELIVERY STATISTICS

 Messages expected: 50000

 Messages expected (adjusted for failures): 46600

 Messages received: 46600 (93.2%)

CLIENT CONNECTION STATISTICS

 Attempted clients: 500

 Clients which missed some messages: 0 (0.0%)

 Clients which sent all messages: 466 (93.2%)

 Clients which experienced connection failures: 34 (6.8%)

	 Connections lost: 34

	 Connections failed: 0

ERROR INFORMATION:

[34] [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionAborted'>: Connection was aborted locally, using.

]

This output means that 34 of the 500 clients got trapped in their clientConnectionLost callback and couldn’t publish any data. I’m unable to reproduce this 100% of the time, but when I do the clients usually get that ConnectionAborted error. Under heavier load I’ve also seen this error:

[Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.TimeoutError'>: User timeout caused connection failure.

]

although bumping up the websocket connect timeout (–websocket_timeout) to something like 60 usually prevents that error from happening. The above logs come from an Ubuntu VM (12.04.3 LTS with wsaccel installed) running on an OS X host. You may need to increase the --num_senders and --num_messages values to see this effect on your machine.

When I try to run this same setup on OS X (without wsaccel), errors usually happen with just a single stress client. I also see a new kind of error (ConnectError):

$ python stressclient.py

BATCH COMPLETED

0:00:11.009030

STATE: DONE

MESSAGE DELIVERY STATISTICS

 Messages expected: 50000

 Messages expected (adjusted for failures): 37500

 Messages received: 37500 (75.0%)

CLIENT CONNECTION STATISTICS

 Attempted clients: 500

 Clients which missed some messages: 0 (0.0%)

 Clients which sent all messages: 375 (75.0%)

 Clients which experienced connection failures: 125 (25.0%)

	 Connections lost: 50

	 Connections failed: 75

ERROR INFORMATION:

[50] [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionAborted'>: Connection was aborted locally, using.

]

[75] [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectError'>: An error occurred while connecting: 54: Connection reset by peer.

]

Note that in both of these cases, the server does not complain about too many files being opened. I’m not seeing Could not accept new connection (EMFILE)in either of these cases (I can produce that error if I run a bunch of stress clients simultaneously, as I would expect without any additional tuning).

Do you have any idea what could be causing these errors? The stress client does make these client connections all at once, but I would expect Twisted (and Autobahn) to handle an influx of ~1000 connections at a time. That doesn’t seem like a crazy intense workload to me.

Optimizations

When I try to increase the number of messages sent by each client, I notice that the stress client sometimes hangs for a few seconds. For example:

$ python pubsub/stressclient.py -m 800

BATCH STARTED

0:00:00.000107

 Messages expected (adjusted for failures): 400000

 Messages received: 0 (0.0%)

 Clients which sent all messages: 0 (0.0%)

 Clients which experienced connection failures: 0 (0.0%)

0:00:05.335226

 Messages expected (adjusted for failures): 400000

 Messages received: 0 (0.0%)

 Clients which sent all messages: 0 (0.0%)

 Clients which experienced connection failures: 0 (0.0%)

0:00:06.003967

 Messages expected (adjusted for failures): 400000

 Messages received: 6802 (1.7005%)

 Clients which sent all messages: 1 (0.2%)

 Clients which experienced connection failures: 0 (0.0%)

Notice that the the first two updates are 5 seconds apart (they’re supposed to be 1 second apart, as defined in the LoopingCall in init_batch). I think this happens because of how long it takes each client to serialize and publish out all 800 messages, which slows down the Twisted reactor. Is there a way to optimize this process for testing purposes? Perhaps by accessing the low-level websocket and publishing a pre-serialized payload to save the clients processing time?

Either way, I hope this utility is helpful and I’d really appreciate any insight feedback.

Nick

0 Likes

#2

Hi Nick,

thanks for sharing! I'll try it myself, but right now am a bit under time pressure .. however, see my comments below.

*Connection failures*
The default configuration tries to connect 500 sender clients and have
them publish 100 messages each. This usually succeeds (all messages get
delivered and received by the monitor), but sometimes the clients
encounter connection errors.

For real load/performance testing, there will be multiple things to consider:

1) Run on a _capable OS_ with sane networking stack and _Twisted reactor_:

Linux/epoll
FreeBSD/kqueue

Neither OSX nor Windows falls into that category. I am not suprised that OSX connects aren't robust under pressure. Apple's kqueue implementation is not very good. OSX only has an API that looks like BSD, but a totally different kernel.

2) You need to do OS level tuning of the TCP/IP stack. Accept queues etc

3) You need to tune Twisted accept queue depth also:

https://github.com/crossbario/crossbar/blob/master/crossbar/crossbar/netservice/hubwebsocket.py#L627

4) You will want to run

wsaccel _and_ ujson

for having both WS and JSON processing in native code.

5) Even with above being done, if you do not throttle the load client, it might overwhelm the acceptor .. have a look at:

https://github.com/tavendo/AutobahnTestSuite/blob/master/autobahntestsuite/autobahntestsuite/massconnect.py

6) Autobahn already serializes an event only once .. and pushes the buffered octets to each receiver:

https://github.com/tavendo/AutobahnPython/blob/master/autobahn/autobahn/wamp.py#L1045

7) Be wary when doing load/performance tests with VMs .. the networking of the hypervisor and host OS then come into play as additional variables.

If you have a chance, could you redo your testing with some/all above taking into consideration?

That would be awseome!

Cheers,
Tobias

0 Likes