Performance and Profiling

#1

Hi,

Just started evaluating Autobahn which seems to be very nice (as I like Twisted from before).

Ran into a strange issue when writing a custom websocket client (for iOS).

Depending on what we send in the handshake there is a big difference in performance in the server.

Basically with

Sec-WebSocket-Version: 8

vs.

Sec-WebSocket-Version: 13

The CPU usage for twistd/autobahn is ~40% up if using 13, all other code is identical, only difference is what’s being sent in the handshake.

Sending this @ 50fps and just echoing back to the client:

RX Frame from 10.0.0.1:60419 : fin = True, rsv = 0, opcode = 1, mask = e81b96d3, length = 18, payload = {"action":"test"}

Haven’t done any profiling yet but until then, two question:

  • In websocket.py what would be the code difference in these two scenarios?

  • Any guidance in how to profile this properly. Just ran some quick tests with cProfile but can’t get any thing useful out of it (never used cProfile actually).

In general I feel it’s a bit slow as with one connection, sending 50packages like above per sec, on my MacBook Air, with 1.8 GHz Intel Core i5, Twisted eats ~7% CPU just handling this.

Again, I just started exploring autobahn and have been doing alot of tcp-socket servers under twisted but never ran into these performance hits this early on before.

best

D

0 Likes

#2

Hi,

Just started evaluating Autobahn which seems to be very nice (as I like
Twisted from before).
Ran into a strange issue when writing a custom websocket client (for iOS).

Any reason you are not using SocketRocket?

https://github.com/square/SocketRocket

They use AutobahnTestSuite, and protocol conformance is very good.

Depending on what we send in the handshake there is a big difference in
performance in the server.

Basically with
Sec-WebSocket-Version: 8
vs.
Sec-WebSocket-Version: 13

The CPU usage for twistd/autobahn is ~40% up if using 13, all other code
is identical, only difference is what's being sent in the handshake.

Sending this @ 50fps and just echoing back to the client:
RX Frame from 10.0.0.1:60419 : fin = True, rsv = 0, opcode = 1, mask =
e81b96d3, length = 18, payload = {"action":"test"}

Haven't done any profiling yet but until then, two question:
  - In websocket.py what would be the code difference in these two
scenarios?

There is only 1 place where code paths differ between 8 and 13:

https://github.com/tavendo/AutobahnPython/blob/master/autobahn/autobahn/websocket.py#L2743

And this only regards the name of the HTTP header storing the WS origin. Obviously, this hasn't any performance impacts, so I have no explanation for your measurements.

  - Any guidance in how to profile this properly. Just ran some quick
tests with cProfile but can't get any thing useful out of it (never used
cProfile actually).

In general I feel it's a bit slow as with one connection, sending
50packages like above per sec, on my MacBook Air, with 1.8 GHz Intel
Core i5, Twisted eats ~7% CPU just handling this.

Again, I just started exploring autobahn and have been doing alot of
tcp-socket servers under twisted but never ran into these performance
hits this early on before.

If you want to do performance measurements and compare to other WS servers, please

1) Make sure you use the best Twisted reactor available for your platform. On OSX, that would be the kqueue-based reactor.

2) Either run Autobahn under PyPy

https://bitbucket.org/pypy/pypy/downloads/pypy-2.0-beta2-osx64.tar.bz2

or use wsaccel. See:

https://github.com/tavendo/AutobahnPython#performance

With the latest Autobahn from GitHub, wsaccel is used automatically if available.

With the latest Autobahn released on PyPi (0.5.14), you need to activate wsaccel as described here

https://github.com/methane/wsaccel

3) The main hotspots of Autobahn are UTF8 validation and WS frame masking, which is the reason the stuff from 2) helps.

Hope this helps,

Tobias

···

Am 23.04.2013 19:25, schrieb Dav01:

best
D

--
You received this message because you are subscribed to the Google
Groups "Autobahn" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to autobahnws+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

0 Likes

#3

Hi Tobias,

Regarding SocketRocket we rolled our own since we wanted the portability to Android.

If you have any tips on stable C/C++ libs let me know.

About WebSocket-Version: 8 vs. 13, I will do some more investigation as it didn’t make sense. So for the websocket transport layer the only actual code-difference is between processDataHixie76 vs. processDataHybi right?

I’ve only done initial quick-n-dirty performance tests on Mac but mainly on our target which is OpenWRT Linux, so Twisted runs under select().

Let me know if you want me to do some tests and share the profiling results.

I’ve considered using PyPy but before that I’d like to understand why sending data @ 50fps will eat about 50% CPU on my 720MHz target.

best

David

0 Likes

#4

Hi Tobias,

Regarding SocketRocket we rolled our own since we wanted the portability
to Android.
If you have any tips on stable C/C++ libs let me know.

Here are my suggestions

C++/ASIO: https://github.com/zaphoyd/websocketpp
native Android: https://github.com/tavendo/AutobahnAndroid

Both are tested with AutobahnTestSuite.

WebSocket++ is the fastest WS OSS implementation I am aware of, and it's protocol compliance is 100% (we have a long and fruitful collaboration between Autobahn and WebSocket++ projects).

On Windows you might want to look at

http://www.serverframework.com/

It's not OSS, but very mature, and the author is a serious networking expert. Also tested with AutobahnTestSuite.

About WebSocket-Version: 8 vs. 13, I will do some more investigation as
it didn't make sense. So for the websocket transport layer the only
actual code-difference is between processDataHixie76 vs. processDataHybi
right?

Yep. Hybi and Hixie differ a lot, so this is where code paths will differ throughout websocket.py

I've only done initial quick-n-dirty performance tests on Mac but mainly
on our target which is OpenWRT Linux, so Twisted runs under select().

Don't know alot about this (we are only DD-WRT _users_), but why doesn't OpenWRT support epoll?

/Let me know if you want me to do some tests and share the profiling
results./

This is definitely interesting! Please don't hesitate to post anything here ..

I've considered using PyPy but before that I'd like to understand why
sending data @ 50fps will eat about 50% CPU on my 720MHz target.

There is something weird going on .. 50% for small WS messages at 50fps on 720mhz ARM seems much too high.

Are you running WSS?

I'd suggest trying the following:

- don't use WSS, but WS
- use binary WS messages
- disable UTF8 validation (setProtocolOptions.utf8validateIncoming = False)
- disable frame masking (setProtocolOptions.maskClientFrames = False and setProtocolOptions.requireMaskedClientFrames = False)

Essentially, this should remove all CPU intensive code paths. Whats the CPU load with above would probably give further insights without complex profiling ..

Tobias

···

Am 24.04.2013 14:35, schrieb Dav01:

best
David

--
You received this message because you are subscribed to the Google
Groups "Autobahn" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to autobahnws+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

0 Likes

#5

Hi Tobias,

Regarding SocketRocket we rolled our own since we wanted the portability

to Android.

If you have any tips on stable C/C++ libs let me know.

Here are my suggestions

C++/ASIO: https://github.com/zaphoyd/websocketpp

native Android: https://github.com/tavendo/AutobahnAndroid

Both are tested with AutobahnTestSuite.

WebSocket++ is the fastest WS OSS implementation I am aware of, and it’s
protocol compliance is 100% (we have a long and fruitful collaboration
between Autobahn and WebSocket++ projects).

Thanks, sounds good, will evaluate it.

About WebSocket-Version: 8 vs. 13, I will do some more investigation as

it didn’t make sense. So for the websocket transport layer the only

actual code-difference is between processDataHixie76 vs. processDataHybi

right?

Yep. Hybi and Hixie differ a lot, so this is where code paths will
differ throughout websocket.py

But for the 8 vs. 13, there is still no difference in the receiver side in Autobahn right.

Performance wise, have you done any benchmarks to compare the two!? I want to get a feeling for the worst case scenario with some old or new browser running protocol rev: x.

I’ve only done initial quick-n-dirty performance tests on Mac but mainly

on our target which is OpenWRT Linux, so Twisted runs under select().

Don’t know alot about this (we are only DD-WRT users), but why doesn’t
OpenWRT support epoll?

I haven’t tried to compare these yet, right now we’ve just used the default twisted reactor (select).

So far the performance issues I’ve seen has been for one connection only, moving to epoll I think could improve some for opening new fd’s / conections but you are right, it’s definitely worth comparing.

/Let me know if you want me to do some tests and share the profiling

results./

This is definitely interesting! Please don’t hesitate to post anything
here …

Will do. Just need to figure out a good way to profile without adding overhead. To be honest, on previous pure tcp-ip based socket projects I’ve done, I’ve never ran into performance issues so the optimization could usually be done just timing the actual python code run, so no need for profilers. But here we’re talking just twisted and a echo server in Autobahn so let me know if you can give me any directions in best way too profile properly. With cProfile I don’t seem to get any useful info as of now.

I’ve considered using PyPy but before that I’d like to understand why

sending data @ 50fps will eat about 50% CPU on my 720MHz target.

There is something weird going on … 50% for small WS messages at 50fps
on 720mhz ARM seems much too high.

Are you running WSS?

Nope just using a basic WebSocketClientFactory(‘ws://IP…’) and call sendMessage every 0.02 sec from one machine to the target server.

I’d suggest trying the following:

  • don’t use WSS, but WS

  • use binary WS messages

will do. but some browsert may still run non bin protocols so we still need to support worst case.

  • disable UTF8 validation (setProtocolOptions.utf8validateIncoming = False)

  • disable frame masking (setProtocolOptions.maskClientFrames = False and
    setProtocolOptions.requireMaskedClientFrames = False)

ok, will try this as well.

Essentially, this should remove all CPU intensive code paths. Whats the
CPU load with above would probably give further insights without complex
profiling …

I’ve attached a quick python -m cProfile -> gprof2dot.py sample from just running a WebSockFactory -> WebSocketServerProtocol echoing whatever (non binary) data that comes in back to the client. As you can see most of the workload happens in processDataHybi.

···

On Wednesday, April 24, 2013 3:13:07 PM UTC+2, Tobias Oberstein wrote:

0 Likes

#6

Hi David,

> But for the 8 vs. 13, there is still no difference in the receiver side

in Autobahn right.
Performance wise, have you done any benchmarks to compare the two!? I

There should be no difference between 8 and 13 performance wise whatsoever. I haven't done specific benchmarking though.

    I'd suggest trying the following:

    - don't use WSS, but WS
    - use binary WS messages

will do. but some browsert may still run non bin protocols so we still
need to support worst case.

    - disable UTF8 validation (setProtocolOptions.utf8validateIncoming =
    False)
    - disable frame masking (setProtocolOptions.maskClientFrames = False
    and
    setProtocolOptions.requireMaskedClientFrames = False)

ok, will try this as well.

Could you please rerun your profiling / CPU load measurements with above again?

That is: epoll, no WSS, UTF8 validation disabled, frame masking disabled.

The results should allow us to nail down the origin of the high CPU load ..

Sorry, I currently have no time to replicate what you are doing here. I'd love to do, but I am just drowning in stuff ..

Tobias

0 Likes

#7

Hi David,

attached is a screenshot showing a quick performance load test running Autobahn on the RaspberryPi on PyPy 2.2.1:

PubSub dispatch rate is 1000 complex events / sec.

Average Roundtrip-time (RTT) is ca. 10ms over locally switched Ethernet.

CPU load is roughly 65% with roughly half of the latter being kernel.

I think that's not bad, given the Pi is ARM11@700MHz which is likely less beefy than your Cortex A8 and _nothing_ at all compared to any decent Xeon thingy (those Xeons are orders of magnitudes more powerful).

The test runs 20 PubSub clients conencted to WAMP server running on the Pi and 1 publisher publishing at 50Hz with a structured event that contains a string field of length 10 .. and more like timestamp, ids etc.

The code is here:

https://github.com/tavendo/AutobahnPython/tree/master/examples/wamp/pubsub/loadlatency

Anyway, if you think that's not fast enough or too much CPU: you likely won't be able to cut the kernel times anyway .. whether in C or C++ or whatever. It might be that the TCP/IP stack can be tuned on the Pi .. to lower the latencies and CPU load, but anyway. I think it's not bad already:

A message broker on the Pi that can do 1000 events/sec with structured payload;)

Now, as said: further speedups (of the userland CPU) can be done: we have a network accelerator in the pipeline:

Uberschall

which does vectorized processing on SSE2 etc .. that could be extended to Neon. It won't cut the kernel times. But user.

Cheers,
/Tobias

···

Am 01.12.2013 13:40, schrieb David Eriksson:

Hi Tobias,

This sounds great. We�re running OpenWrt with Linux 3.2 (but soon finished migrating to 3.12).

The openwrt package manager builds python 2.7.3 for us via GCC.
https://dev.openwrt.org/browser/packages/lang/python/Makefile

Right now we�re building with toolchain-arm_v7-a_gcc-4.6-linaro_eglibc-2.15_eabi
and these options
CFLAGS="-Os -pipe -march=armv7-a -mtune=cortex-a8 -fno-caller-saves -mfpu=neon -mfloat-abi=hard -fhonour-copts -Wno-error=unused-but-set-variable -fpic -DSQLITE_ENABLE_UNLOCK_NOTIFY "

I have on my list to experiment with pypy & also wsaccel. Just need to setup some isolated test-runners to be able to compare the results. But my plan is to to first figure out exactly what�s causing a high cpu-load.

As I mentioned before, on our 600MHz and a simple ws echo server, sending ~10bytes @ around 25Hz to 1 client eats unreasonable amounts of cpu. Adding a few more subscribers to the same data it sometimes feels like a queue is built up somewhere adding latency to the mix.

Let me know if you have some ideas on how to profile and get some stats out to nail down the first couple of bottlenecks.

Best David

On 28 Nov 2013, at 15:34, Tobias Oberstein <tobias.o...@gmail.com> wrote:

Hi David,

Am 27.11.2013 14:52, schrieb David Eriksson:

Hi Tobias,

Half year later we�re back on this. We just finished setting up some test-runners to simulate various incoming traffic scenarious through a simple twisted/autobahn relay server to do some isolated profiling.

Our target is a ARM platform (AM335x) so basically no floats except via NEON co-processor if the correct build-flags are on.

I saw your new project crossbar and figured, maybe you�ve been digging deeper in getting Autobahn run on embedded platforms such as Rasperry, Beaglebone, etc.

We are indeed using Autobahn and Crossbar on the Pi (and devices of similar capability). Lately, I've been looking into running on even less capable devices (Arduino Yun) .. see that other currently running thread / conversation with Peter. That is a 32 bit MIPS CPU with less steam and RAM than the Pi.

If ARM soft-float, standard hardware float or Neon float makes a difference to Autobahn, I haven't analyzed. I tend to think that would be surprising. But I don't know.

So the AM335x is a 32 Bit ARM with MMU and Neon, but without "standard" hardware float? What compile flags do you use? GCC? Which version?

Ie on x86-64, doing SSE based float is actually the default how GCC and Clang compile nowerdays. Using the "old" fp87 unit is actually discouraged unless you need the 80 bits fp precision (and not the IEEE standard 64 bit that SSE float provides).

What OS are you running? Do you build your CPython yourself? How? Can you build PyPy?

Sorry, lots of questions;)

I'd really like to get this py setup tight and am confident it�s possible to send 10bytes to 30 clients at 50 Hz without a problem. But something is eating all our CPU and outgoing traffic seems to be queued somewhere.

I�m actually a hardware guy but can hear my colleagues voting for changing to rewrite the websocket stuff in C++ but I�m not so sure :slight_smile: Let me know if you have some ideas how to make this blazing fast.

Well, I'd take a challenge beating C++ if you put a trophy on the table;) I'd take that since I did extensive profiling and experiments lately of Autobahn multi-core against Netty on Java Hotspot and WebSocket++/ASIO/Boost. I am confident we can compete / surpass. Let me just throw in a couple of links and info bits (the whole story is much longer, and I am still preparing a blog post with all the hard numbers etc):

We are now using PyPy (or CPy+wsaccel) to run Autobahn-based stuff.

PyPy is a tracing JIT with an awesome GC that beats Java Hotspot in my measurements of latency (max, 95% quantiles etc). It produces machine code close to native C/C++.

Btw: PyPy is now on the Pi also: http://www.raspberrypi.org/archives/3881

For CPython, we have wsaccel, a native C code accelerator which is open-source.

Autobahn on PyPy and CPy+wsaccel is very fast.

I wasn't satisfied still (huh, only somewhat faster than Java on Hotspot, a VM with hundreds of man years efforts but into? Not good enough).

So I sat down and wrote an accelerator (for hot code paths in WS) in SSE2/SSSE3/AVX assembly (actually C SIMD intrinsics):

"Uberschall - A networking accelerator library"

[unreleased, proprietory]

This is hand-optimized, vectorized code that runs on 256/512 registers. It can process raw WebSocket at the multiple GB/s speed. On an 4 year old Nehalem Xeon. I don't have the physical hardware, but I guess you could saturate _multiple_ 10GbE NICs on current Xeons. It's faster than anything I ever measured in the WebSocket land. Forget _pure_ C/C++ - you won't have a chance against SIMD vectorized assembly;) And that's actually the perfect thing (in my view): going bare-metal for the really hot loops, and having a high-level, secure, dynamic language like Python for the rest. And remember: on PyPy, even the latter is also (often) JITted to native (scalar) machine code.

Anyway, above benchmarking was all done on Xeon .. not low-power ARM. My issue of "work queue full" still persists 'till today;)

/Tobias

best
David

On 25 Apr 2013, at 14:58, Tobias Oberstein <tobias.o...@gmail.com> wrote:

Hi David,

But for the 8 vs. 13, there is still no difference in the receiver side
in Autobahn right.
Performance wise, have you done any benchmarks to compare the two!? I

There should be no difference between 8 and 13 performance wise whatsoever. I haven't done specific benchmarking though.

    I'd suggest trying the following:

    - don't use WSS, but WS
    - use binary WS messages

will do. but some browsert may still run non bin protocols so we still
need to support worst case.

    - disable UTF8 validation (setProtocolOptions.utf8validateIncoming =
    False)
    - disable frame masking (setProtocolOptions.maskClientFrames = False
    and
    setProtocolOptions.requireMaskedClientFrames = False)

ok, will try this as well.

Could you please rerun your profiling / CPU load measurements with above again?

That is: epoll, no WSS, UTF8 validation disabled, frame masking disabled.

The results should allow us to nail down the origin of the high CPU load ..

Sorry, I currently have no time to replicate what you are doing here. I'd love to do, but I am just drowning in stuff ..

Tobias

0 Likes

#8

Hi Tobias,

This is great. Thanks for pushing the test-code. We’ll do some tests here and let you all know our findings.

In the end once performance is tuned, I want to make a dynamic throttling of “real-time” data (kind of like a resampler).
Maybe there are things like this already somewhere deep inside twisted!?

So, let’s say we’re pushing visualisation data from our ARM to the subscribed clients. So in a home environment, most common use-case might be 1 client at a time so then we could run the msg publisher at 100hz, but dynamically scale down the publisher frequency as more clients are added or cpu usage is affected.

Anyway, many thanks Tobias for digging into this. I’ll start here without pypy and let you know the results over time.
Btw, just realised we’re running a TwistedCore 2.5 package from Jan 2007 on our ARM. Not sure what has happened in terms of tuning since then, but I’m going to bump a bit.

What versions of Twsited and Autobahn did you run during your tests?

best David

···

On 01 Dec 2013, at 21:43, Tobias Oberstein <tobias.o...@gmail.com> wrote:

Hi David,

attached is a screenshot showing a quick performance load test running Autobahn on the RaspberryPi on PyPy 2.2.1:

PubSub dispatch rate is 1000 complex events / sec.

Average Roundtrip-time (RTT) is ca. 10ms over locally switched Ethernet.

CPU load is roughly 65% with roughly half of the latter being kernel.

I think that's not bad, given the Pi is ARM11@700MHz which is likely less beefy than your Cortex A8 and _nothing_ at all compared to any decent Xeon thingy (those Xeons are orders of magnitudes more powerful).

The test runs 20 PubSub clients conencted to WAMP server running on the Pi and 1 publisher publishing at 50Hz with a structured event that contains a string field of length 10 .. and more like timestamp, ids etc.

The code is here:

https://github.com/tavendo/AutobahnPython/tree/master/examples/wamp/pubsub/loadlatency

Anyway, if you think that's not fast enough or too much CPU: you likely won't be able to cut the kernel times anyway .. whether in C or C++ or whatever. It might be that the TCP/IP stack can be tuned on the Pi .. to lower the latencies and CPU load, but anyway. I think it's not bad already:

A message broker on the Pi that can do 1000 events/sec with structured payload;)

Now, as said: further speedups (of the userland CPU) can be done: we have a network accelerator in the pipeline:

Uberschall

which does vectorized processing on SSE2 etc .. that could be extended to Neon. It won't cut the kernel times. But user.

Cheers,
/Tobias

Am 01.12.2013 13:40, schrieb David Eriksson:

Hi Tobias,

This sounds great. We’re running OpenWrt with Linux 3.2 (but soon finished migrating to 3.12).

The openwrt package manager builds python 2.7.3 for us via GCC.
https://dev.openwrt.org/browser/packages/lang/python/Makefile

Right now we’re building with toolchain-arm_v7-a_gcc-4.6-linaro_eglibc-2.15_eabi
and these options
CFLAGS="-Os -pipe -march=armv7-a -mtune=cortex-a8 -fno-caller-saves -mfpu=neon -mfloat-abi=hard -fhonour-copts -Wno-error=unused-but-set-variable -fpic -DSQLITE_ENABLE_UNLOCK_NOTIFY "

I have on my list to experiment with pypy & also wsaccel. Just need to setup some isolated test-runners to be able to compare the results. But my plan is to to first figure out exactly what’s causing a high cpu-load.

As I mentioned before, on our 600MHz and a simple ws echo server, sending ~10bytes @ around 25Hz to 1 client eats unreasonable amounts of cpu. Adding a few more subscribers to the same data it sometimes feels like a queue is built up somewhere adding latency to the mix.

Let me know if you have some ideas on how to profile and get some stats out to nail down the first couple of bottlenecks.

Best David

On 28 Nov 2013, at 15:34, Tobias Oberstein <tobias.o...@gmail.com> wrote:

Hi David,

Am 27.11.2013 14:52, schrieb David Eriksson:

Hi Tobias,

Half year later we’re back on this. We just finished setting up some test-runners to simulate various incoming traffic scenarious through a simple twisted/autobahn relay server to do some isolated profiling.

Our target is a ARM platform (AM335x) so basically no floats except via NEON co-processor if the correct build-flags are on.

I saw your new project crossbar and figured, maybe you’ve been digging deeper in getting Autobahn run on embedded platforms such as Rasperry, Beaglebone, etc.

We are indeed using Autobahn and Crossbar on the Pi (and devices of similar capability). Lately, I've been looking into running on even less capable devices (Arduino Yun) .. see that other currently running thread / conversation with Peter. That is a 32 bit MIPS CPU with less steam and RAM than the Pi.

If ARM soft-float, standard hardware float or Neon float makes a difference to Autobahn, I haven't analyzed. I tend to think that would be surprising. But I don't know.

So the AM335x is a 32 Bit ARM with MMU and Neon, but without "standard" hardware float? What compile flags do you use? GCC? Which version?

Ie on x86-64, doing SSE based float is actually the default how GCC and Clang compile nowerdays. Using the "old" fp87 unit is actually discouraged unless you need the 80 bits fp precision (and not the IEEE standard 64 bit that SSE float provides).

What OS are you running? Do you build your CPython yourself? How? Can you build PyPy?

Sorry, lots of questions;)

I'd really like to get this py setup tight and am confident it’s possible to send 10bytes to 30 clients at 50 Hz without a problem. But something is eating all our CPU and outgoing traffic seems to be queued somewhere.

I’m actually a hardware guy but can hear my colleagues voting for changing to rewrite the websocket stuff in C++ but I’m not so sure :slight_smile: Let me know if you have some ideas how to make this blazing fast.

Well, I'd take a challenge beating C++ if you put a trophy on the table;) I'd take that since I did extensive profiling and experiments lately of Autobahn multi-core against Netty on Java Hotspot and WebSocket++/ASIO/Boost. I am confident we can compete / surpass. Let me just throw in a couple of links and info bits (the whole story is much longer, and I am still preparing a blog post with all the hard numbers etc):

We are now using PyPy (or CPy+wsaccel) to run Autobahn-based stuff.

PyPy is a tracing JIT with an awesome GC that beats Java Hotspot in my measurements of latency (max, 95% quantiles etc). It produces machine code close to native C/C++.

Btw: PyPy is now on the Pi also: http://www.raspberrypi.org/archives/3881

For CPython, we have wsaccel, a native C code accelerator which is open-source.

Autobahn on PyPy and CPy+wsaccel is very fast.

I wasn't satisfied still (huh, only somewhat faster than Java on Hotspot, a VM with hundreds of man years efforts but into? Not good enough).

So I sat down and wrote an accelerator (for hot code paths in WS) in SSE2/SSSE3/AVX assembly (actually C SIMD intrinsics):

"Uberschall - A networking accelerator library"

[unreleased, proprietory]

This is hand-optimized, vectorized code that runs on 256/512 registers. It can process raw WebSocket at the multiple GB/s speed. On an 4 year old Nehalem Xeon. I don't have the physical hardware, but I guess you could saturate _multiple_ 10GbE NICs on current Xeons. It's faster than anything I ever measured in the WebSocket land. Forget _pure_ C/C++ - you won't have a chance against SIMD vectorized assembly;) And that's actually the perfect thing (in my view): going bare-metal for the really hot loops, and having a high-level, secure, dynamic language like Python for the rest. And remember: on PyPy, even the latter is also (often) JITted to native (scalar) machine code.

Anyway, above benchmarking was all done on Xeon .. not low-power ARM. My issue of "work queue full" still persists 'till today;)

/Tobias

best
David

On 25 Apr 2013, at 14:58, Tobias Oberstein <tobias.o...@gmail.com> wrote:

Hi David,

But for the 8 vs. 13, there is still no difference in the receiver side
in Autobahn right.
Performance wise, have you done any benchmarks to compare the two!? I

There should be no difference between 8 and 13 performance wise whatsoever. I haven't done specific benchmarking though.

   I'd suggest trying the following:

   - don't use WSS, but WS
   - use binary WS messages

will do. but some browsert may still run non bin protocols so we still
need to support worst case.

   - disable UTF8 validation (setProtocolOptions.utf8validateIncoming =
   False)
   - disable frame masking (setProtocolOptions.maskClientFrames = False
   and
   setProtocolOptions.requireMaskedClientFrames = False)

ok, will try this as well.

Could you please rerun your profiling / CPU load measurements with above again?

That is: epoll, no WSS, UTF8 validation disabled, frame masking disabled.

The results should allow us to nail down the origin of the high CPU load ..

Sorry, I currently have no time to replicate what you are doing here. I'd love to do, but I am just drowning in stuff ..

Tobias

<Clipboard03.png>

0 Likes

#9

Hi Tobias,

This is great. Thanks for pushing the test-code. We�ll do some tests here and let you all know our findings.

Yes, please post!

In the meantime, I did more experiments playing around with various knobs. These are the maximums I get:

6000 events/s dispatched
30Mb/s net payload pushed

This is then at 100% CPU load on the Pi.

I think this is pretty good. And I don't expect that you could to significantly better in C or anything - without taking a radically different approach (with radical, I mean, bypass the TCP/IP in the kernel altogether).

In the end once performance is tuned, I want to make a dynamic throttling of �real-time� data (kind of like a resampler).

Throttling is a term which I'd understand as adapting to a slow receiver. This isn't your scenario I guess, and it would lead to buffering anyway. You could do that with Twisted and Autobahn (see the Producer/Consumer examples). But with PubSub it gets complex: if you have N subscribers and 1 publisher, you would want to excert backpressure on the publisher whenever at least 1 subscriber can't keep up.

"Resampling" means the broker has to decide how events would be "merged" together (only dispatch every Nth event published, average the values inside the events etc). This path leads to putting application logic into the broker ..

Another approach is coalescing events in the broker (at the cost of event latency): instead of immediately dispatching any event published, buffer the events, and then send out buffered events in batches to clients. This will reduce the syscall rate (socket sendto()). This will likely help, since I have that feeling that syscall rate on the Pi is limiting.

Maybe there are things like this already somewhere deep inside twisted!?

So, let�s say we�re pushing visualisation data from our ARM to the subscribed clients. So in a home environment, most common use-case might be 1 client at a time so then we could run the msg publisher at 100hz, but dynamically scale down the publisher frequency as more clients are added or cpu usage is affected.

Anyway, many thanks Tobias for digging into this. I�ll start here without pypy and let you know the results over time.
Btw, just realised we�re running a TwistedCore 2.5 package from Jan 2007 on our ARM. Not sure what has happened in terms of tuning since then, but I�m going to bump a bit.

>
> What versions of Twsited and Autobahn did you run during your tests?
>

I run Twisted 13.2 and Autobahn 0.6.5. You should really use the latest code, there is no need to use distro packages. Dump those.

Btw: If you try to install Autobahn via easy_install, anything <Twisted 11 won't work anyway. It's in the Python package dependency spec of Autobahn.

Cheers,
/Tobias

···

Am 03.12.2013 13:19, schrieb David Eriksson:

best David

On 01 Dec 2013, at 21:43, Tobias Oberstein <tobias.o...@gmail.com> wrote:

Hi David,

attached is a screenshot showing a quick performance load test running Autobahn on the RaspberryPi on PyPy 2.2.1:

PubSub dispatch rate is 1000 complex events / sec.

Average Roundtrip-time (RTT) is ca. 10ms over locally switched Ethernet.

CPU load is roughly 65% with roughly half of the latter being kernel.

I think that's not bad, given the Pi is ARM11@700MHz which is likely less beefy than your Cortex A8 and _nothing_ at all compared to any decent Xeon thingy (those Xeons are orders of magnitudes more powerful).

The test runs 20 PubSub clients conencted to WAMP server running on the Pi and 1 publisher publishing at 50Hz with a structured event that contains a string field of length 10 .. and more like timestamp, ids etc.

The code is here:

https://github.com/tavendo/AutobahnPython/tree/master/examples/wamp/pubsub/loadlatency

Anyway, if you think that's not fast enough or too much CPU: you likely won't be able to cut the kernel times anyway .. whether in C or C++ or whatever. It might be that the TCP/IP stack can be tuned on the Pi .. to lower the latencies and CPU load, but anyway. I think it's not bad already:

A message broker on the Pi that can do 1000 events/sec with structured payload;)

Now, as said: further speedups (of the userland CPU) can be done: we have a network accelerator in the pipeline:

Uberschall

which does vectorized processing on SSE2 etc .. that could be extended to Neon. It won't cut the kernel times. But user.

Cheers,
/Tobias

Am 01.12.2013 13:40, schrieb David Eriksson:

Hi Tobias,

This sounds great. We�re running OpenWrt with Linux 3.2 (but soon finished migrating to 3.12).

The openwrt package manager builds python 2.7.3 for us via GCC.
https://dev.openwrt.org/browser/packages/lang/python/Makefile

Right now we�re building with toolchain-arm_v7-a_gcc-4.6-linaro_eglibc-2.15_eabi
and these options
CFLAGS="-Os -pipe -march=armv7-a -mtune=cortex-a8 -fno-caller-saves -mfpu=neon -mfloat-abi=hard -fhonour-copts -Wno-error=unused-but-set-variable -fpic -DSQLITE_ENABLE_UNLOCK_NOTIFY "

I have on my list to experiment with pypy & also wsaccel. Just need to setup some isolated test-runners to be able to compare the results. But my plan is to to first figure out exactly what�s causing a high cpu-load.

As I mentioned before, on our 600MHz and a simple ws echo server, sending ~10bytes @ around 25Hz to 1 client eats unreasonable amounts of cpu. Adding a few more subscribers to the same data it sometimes feels like a queue is built up somewhere adding latency to the mix.

Let me know if you have some ideas on how to profile and get some stats out to nail down the first couple of bottlenecks.

Best David

On 28 Nov 2013, at 15:34, Tobias Oberstein <tobias.o...@gmail.com> wrote:

Hi David,

Am 27.11.2013 14:52, schrieb David Eriksson:

Hi Tobias,

Half year later we�re back on this. We just finished setting up some test-runners to simulate various incoming traffic scenarious through a simple twisted/autobahn relay server to do some isolated profiling.

Our target is a ARM platform (AM335x) so basically no floats except via NEON co-processor if the correct build-flags are on.

I saw your new project crossbar and figured, maybe you�ve been digging deeper in getting Autobahn run on embedded platforms such as Rasperry, Beaglebone, etc.

We are indeed using Autobahn and Crossbar on the Pi (and devices of similar capability). Lately, I've been looking into running on even less capable devices (Arduino Yun) .. see that other currently running thread / conversation with Peter. That is a 32 bit MIPS CPU with less steam and RAM than the Pi.

If ARM soft-float, standard hardware float or Neon float makes a difference to Autobahn, I haven't analyzed. I tend to think that would be surprising. But I don't know.

So the AM335x is a 32 Bit ARM with MMU and Neon, but without "standard" hardware float? What compile flags do you use? GCC? Which version?

Ie on x86-64, doing SSE based float is actually the default how GCC and Clang compile nowerdays. Using the "old" fp87 unit is actually discouraged unless you need the 80 bits fp precision (and not the IEEE standard 64 bit that SSE float provides).

What OS are you running? Do you build your CPython yourself? How? Can you build PyPy?

Sorry, lots of questions;)

I'd really like to get this py setup tight and am confident it�s possible to send 10bytes to 30 clients at 50 Hz without a problem. But something is eating all our CPU and outgoing traffic seems to be queued somewhere.

I�m actually a hardware guy but can hear my colleagues voting for changing to rewrite the websocket stuff in C++ but I�m not so sure :slight_smile: Let me know if you have some ideas how to make this blazing fast.

Well, I'd take a challenge beating C++ if you put a trophy on the table;) I'd take that since I did extensive profiling and experiments lately of Autobahn multi-core against Netty on Java Hotspot and WebSocket++/ASIO/Boost. I am confident we can compete / surpass. Let me just throw in a couple of links and info bits (the whole story is much longer, and I am still preparing a blog post with all the hard numbers etc):

We are now using PyPy (or CPy+wsaccel) to run Autobahn-based stuff.

PyPy is a tracing JIT with an awesome GC that beats Java Hotspot in my measurements of latency (max, 95% quantiles etc). It produces machine code close to native C/C++.

Btw: PyPy is now on the Pi also: http://www.raspberrypi.org/archives/3881

For CPython, we have wsaccel, a native C code accelerator which is open-source.

Autobahn on PyPy and CPy+wsaccel is very fast.

I wasn't satisfied still (huh, only somewhat faster than Java on Hotspot, a VM with hundreds of man years efforts but into? Not good enough).

So I sat down and wrote an accelerator (for hot code paths in WS) in SSE2/SSSE3/AVX assembly (actually C SIMD intrinsics):

"Uberschall - A networking accelerator library"

[unreleased, proprietory]

This is hand-optimized, vectorized code that runs on 256/512 registers. It can process raw WebSocket at the multiple GB/s speed. On an 4 year old Nehalem Xeon. I don't have the physical hardware, but I guess you could saturate _multiple_ 10GbE NICs on current Xeons. It's faster than anything I ever measured in the WebSocket land. Forget _pure_ C/C++ - you won't have a chance against SIMD vectorized assembly;) And that's actually the perfect thing (in my view): going bare-metal for the really hot loops, and having a high-level, secure, dynamic language like Python for the rest. And remember: on PyPy, even the latter is also (often) JITted to native (scalar) machine code.

Anyway, above benchmarking was all done on Xeon .. not low-power ARM. My issue of "work queue full" still persists 'till today;)

/Tobias

best
David

On 25 Apr 2013, at 14:58, Tobias Oberstein <tobias.o...@gmail.com> wrote:

Hi David,

But for the 8 vs. 13, there is still no difference in the receiver side
in Autobahn right.
Performance wise, have you done any benchmarks to compare the two!? I

There should be no difference between 8 and 13 performance wise whatsoever. I haven't done specific benchmarking though.

    I'd suggest trying the following:

    - don't use WSS, but WS
    - use binary WS messages

will do. but some browsert may still run non bin protocols so we still
need to support worst case.

    - disable UTF8 validation (setProtocolOptions.utf8validateIncoming =
    False)
    - disable frame masking (setProtocolOptions.maskClientFrames = False
    and
    setProtocolOptions.requireMaskedClientFrames = False)

ok, will try this as well.

Could you please rerun your profiling / CPU load measurements with above again?

That is: epoll, no WSS, UTF8 validation disabled, frame masking disabled.

The results should allow us to nail down the origin of the high CPU load ..

Sorry, I currently have no time to replicate what you are doing here. I'd love to do, but I am just drowning in stuff ..

Tobias

<Clipboard03.png>

0 Likes

#10

Hi,

In the meantime, I did more experiments playing around with various knobs. These are the maximums I get:

6000 events/s dispatched
30Mb/s net payload pushed

This is then at 100% CPU load on the Pi.

Wow! Interesting. Did you try the same without PyPy?

I think this is pretty good. And I don't expect that you could to significantly better in C or anything - without taking a radically different approach (with radical, I mean, bypass the TCP/IP in the kernel altogether).

Agree

Throttling is a term which I'd understand as adapting to a slow receiver. This isn't your scenario I guess, and it would lead to buffering anyway. You could do that with Twisted and Autobahn (see the Producer/Consumer examples). But with PubSub it gets complex: if you have N subscribers and 1 publisher, you would want to excert backpressure on the publisher whenever at least 1 subscriber can't keep up.

I think here we will most likely drop a connection if one client is not behaving properly.

"Resampling" means the broker has to decide how events would be "merged" together (only dispatch every Nth event published, average the values inside the events etc). This path leads to putting application logic into the broker ..

For visuals this is fine, say plotting a graph. Adding together 5-10 packages and interpolate a bit on the receiver often looks much better than one by one. Small packages will even get lumped together by buffers on the network interfaces.

Another approach is coalescing events in the broker (at the cost of event latency): instead of immediately dispatching any event published, buffer the events, and then send out buffered events in batches to clients. This will reduce the syscall rate (socket sendto()). This will likely help, since I have that feeling that syscall rate on the Pi is limiting.

Yes.

I run Twisted 13.2 and Autobahn 0.6.5. You should really use the latest code, there is no need to use distro packages. Dump those.

Are there any C-speedups in the twisted tree I should be aware of (in my case I don’t have a tool-chain on the target).

best
David

···

On 03 Dec 2013, at 14:26, Tobias Oberstein <tobias.o...@gmail.com> wrote:

0 Likes

#11

In the meantime, I did more experiments playing around with various knobs. These are the maximums I get:

6000 events/s dispatched
30Mb/s net payload pushed

This is then at 100% CPU load on the Pi.

Wow! Interesting. Did you try the same without PyPy?

No, I usually don't bother with CPython these days .. PyPy has just a lot more steam. I'll write a blog post about my setup and findings, so you can try for / assure yourself.

/Tobias

0 Likes

#12

"Resampling" means the broker has to decide how events would be "merged" together (only dispatch every Nth event published, average the values inside the events etc). This path leads to putting application logic into the broker ..

For visuals this is fine, say plotting a graph. Adding together 5-10 packages and interpolate a bit on the receiver often looks much better than one by one. Small packages will even get lumped together by buffers on the network interfaces.

Rgd "small packets" being lumpted together in the TCP/IP stack: thats right, but that doesn't help with the syscall rate ..

When I do socket.send('*') often enough, the OS will fall over despite all those '*' being coalesced inside the kernel TCP before going onto the wire.

What I meant is: coalescing those many '*' in userland and doing less syscalls into kernel TCP (socket.send).

You might also play with turning off "No delay" (Nagle) option with Autobahn .. which is on by default.

I run Twisted 13.2 and Autobahn 0.6.5. You should really use the latest code, there is no need to use distro packages. Dump those.

Are there any C-speedups in the twisted tree I should be aware of (in my case I don�t have a tool-chain on the target).

No. Not in Twisted. There are some binary extension modules in Twisted, but those are generally not needed and are not performance related anyway.

With Autobahn on CPython, you should use wsaccel and ujson (C native libs).

But anyway: go with PyPy. They have binaries to download for ARM:

http://pypy.org/download.html

If you want to build PyPy from source: that is not for the faint hearted;) You will need a cross-build toolchain .. building PyPy on the target is theoretically possible, but will take ages. Building PyPy on my 3.4GHz Core i7 12GB RAM here takes 1.5h.

/Tobias

0 Likes