Using Twitter on our WebKit is not comfortable, because it's slow. So the next step is make it faster.
How can we find the bottle neck?
Using binary search using nowInMsec() API and rdtsc(), narrowed down where is the bottle neck. Finally, I found most of time is consumed at memcpy. Copying received datum to another buffer to merge them.
What should we do next?
We should think following things.
- Is our memcpy is fast enough?
- Can we reduce # of calls of memcpy?
Is our memcpy is fast enough?
Yes. We use memcpy written in assembly borrowed from newlib. It uses rep and movsl. It's as fast as __builtin_memcpy.
One aggressive way is to use SSE optimized memcpy. Should I try it? I missed Google Code Search, I can't find a SSE optimized memcpy for GCC.
Can we reduce # of calls of memcpy?
Yes technically. But I don't want to change a code of lwip.
I guess another thing we should consider is timer. WebKit, curl and lwip are using timer.
WebKit is using it for event handling. Curl is using it for timeout handling. lwip is using it for TCP re-transmission.
If timer function is not good enough, it causes bad performance.