This sounds great. Weï¿½re running OpenWrt with Linux 3.2 (but soon finished migrating to 3.12).
Right now weï¿½re building with toolchain-arm_v7-a_gcc-4.6-linaro_eglibc-2.15_eabi
and these options
CFLAGS="-Os -pipe -march=armv7-a -mtune=cortex-a8 -fno-caller-saves -mfpu=neon -mfloat-abi=hard -fhonour-copts -Wno-error=unused-but-set-variable -fpic -DSQLITE_ENABLE_UNLOCK_NOTIFY "
I have on my list to experiment with pypy & also wsaccel. Just need to setup some isolated test-runners to be able to compare the results. But my plan is to to first figure out exactly whatï¿½s causing a high cpu-load.
As I mentioned before, on our 600MHz and a simple ws echo server, sending ~10bytes @ around 25Hz to 1 client eats unreasonable amounts of cpu. Adding a few more subscribers to the same data it sometimes feels like a queue is built up somewhere adding latency to the mix.
Let me know if you have some ideas on how to profile and get some stats out to nail down the first couple of bottlenecks.
Am 27.11.2013 14:52, schrieb David Eriksson:
Half year later weï¿½re back on this. We just finished setting up some test-runners to simulate various incoming traffic scenarious through a simple twisted/autobahn relay server to do some isolated profiling.
Our target is a ARM platform (AM335x) so basically no floats except via NEON co-processor if the correct build-flags are on.
I saw your new project crossbar and figured, maybe youï¿½ve been digging deeper in getting Autobahn run on embedded platforms such as Rasperry, Beaglebone, etc.
We are indeed using Autobahn and Crossbar on the Pi (and devices of similar capability). Lately, I've been looking into running on even less capable devices (Arduino Yun) .. see that other currently running thread / conversation with Peter. That is a 32 bit MIPS CPU with less steam and RAM than the Pi.
If ARM soft-float, standard hardware float or Neon float makes a difference to Autobahn, I haven't analyzed. I tend to think that would be surprising. But I don't know.
So the AM335x is a 32 Bit ARM with MMU and Neon, but without "standard" hardware float? What compile flags do you use? GCC? Which version?
Ie on x86-64, doing SSE based float is actually the default how GCC and Clang compile nowerdays. Using the "old" fp87 unit is actually discouraged unless you need the 80 bits fp precision (and not the IEEE standard 64 bit that SSE float provides).
What OS are you running? Do you build your CPython yourself? How? Can you build PyPy?
Sorry, lots of questions;)
I'd really like to get this py setup tight and am confident itï¿½s possible to send 10bytes to 30 clients at 50 Hz without a problem. But something is eating all our CPU and outgoing traffic seems to be queued somewhere.
Iï¿½m actually a hardware guy but can hear my colleagues voting for changing to rewrite the websocket stuff in C++ but Iï¿½m not so sure Let me know if you have some ideas how to make this blazing fast.
Well, I'd take a challenge beating C++ if you put a trophy on the table;) I'd take that since I did extensive profiling and experiments lately of Autobahn multi-core against Netty on Java Hotspot and WebSocket++/ASIO/Boost. I am confident we can compete / surpass. Let me just throw in a couple of links and info bits (the whole story is much longer, and I am still preparing a blog post with all the hard numbers etc):
We are now using PyPy (or CPy+wsaccel) to run Autobahn-based stuff.
PyPy is a tracing JIT with an awesome GC that beats Java Hotspot in my measurements of latency (max, 95% quantiles etc). It produces machine code close to native C/C++.
Btw: PyPy is now on the Pi also: http://www.raspberrypi.org/archives/3881
For CPython, we have wsaccel, a native C code accelerator which is open-source.
Autobahn on PyPy and CPy+wsaccel is very fast.
I wasn't satisfied still (huh, only somewhat faster than Java on Hotspot, a VM with hundreds of man years efforts but into? Not good enough).
So I sat down and wrote an accelerator (for hot code paths in WS) in SSE2/SSSE3/AVX assembly (actually C SIMD intrinsics):
"Uberschall - A networking accelerator library"
This is hand-optimized, vectorized code that runs on 256/512 registers. It can process raw WebSocket at the multiple GB/s speed. On an 4 year old Nehalem Xeon. I don't have the physical hardware, but I guess you could saturate _multiple_ 10GbE NICs on current Xeons. It's faster than anything I ever measured in the WebSocket land. Forget _pure_ C/C++ - you won't have a chance against SIMD vectorized assembly;) And that's actually the perfect thing (in my view): going bare-metal for the really hot loops, and having a high-level, secure, dynamic language like Python for the rest. And remember: on PyPy, even the latter is also (often) JITted to native (scalar) machine code.
Anyway, above benchmarking was all done on Xeon .. not low-power ARM. My issue of "work queue full" still persists 'till today;)
On 25 Apr 2013, at 14:58, Tobias Oberstein <tobias.o...@gmail.com> wrote:
But for the 8 vs. 13, there is still no difference in the receiver side
in Autobahn right.
Performance wise, have you done any benchmarks to compare the two!? I
There should be no difference between 8 and 13 performance wise whatsoever. I haven't done specific benchmarking though.
I'd suggest trying the following:
- don't use WSS, but WS
- use binary WS messages
will do. but some browsert may still run non bin protocols so we still
need to support worst case.
- disable UTF8 validation (setProtocolOptions.utf8validateIncoming =
- disable frame masking (setProtocolOptions.maskClientFrames = False
setProtocolOptions.requireMaskedClientFrames = False)
ok, will try this as well.
Could you please rerun your profiling / CPU load measurements with above again?
That is: epoll, no WSS, UTF8 validation disabled, frame masking disabled.
The results should allow us to nail down the origin of the high CPU load ..
Sorry, I currently have no time to replicate what you are doing here. I'd love to do, but I am just drowning in stuff ..