All code is available on my github repo.

python-vs-go. Credit to https://bitfieldconsulting.com/posts/go-vs-python

Image borrowed from the awesome Bitfield Consulting

Introduction

Python is a great language and it’s ML libraries are second to none, but it’s type system, it’s dependency environment… I found it very hard to write safe code that I’m very sure about. Generating a docker image below 1.5Gb is also almost impossible.

The more I worked with it the more I missed Go for it’s compiled artefacts, performance and specially it’s multiprocess capabilities.

“beautifull” code vs efficient concurrency. https://devopedia.org/go-language

Graph of beautiful code vs efficiency of concurrency. Seen at devopedia

Go has been marketed as a solution to write beautifull concurrent code. Go has an amazing syntax for concurrent programming. Aside from developer experience I want to test to what extent it compares to Python in single-threaded and multi-threaded benchmarks.

Notes:

All tests ran in a Ryzen 5 3600 with 32Gb of Ram (4x8, 2666Mhz) using Python 3.12.7 and Go 1.23.2 in Arch Linux.

All benchmark times are given by running perf stats and time means total elapsed time unless stated otherwise.

CPU intensive test:

Function definition:

I’ll use the francois viete pi series function as shared by CodeDrome. There is no practical point to iterate as many times as we are going to but it sure uses a lot of CPU if iterated through enough times.

def calcPi(iterations: int) -> float:
    numerator = 0.0
    pi = 1.0

    for _ in range(1, iterations + 1):
        numerator = math.sqrt(2.0 + numerator)
        pi *= numerator / 2.0

    return (1.0 / pi) * 2.0

Adapted to Go it looks like:

func calculatePi(nIter int64) (float64){

    numerator := 0.0
    pi := 1.0

    for n := int64(0); n < nIter; n++{
        numerator = math.Sqrt(2.0 + numerator)
        pi = pi * (numerator / 2.0)
    }

    return (1.0 / pi) * 2.0
}

Finding a comparable single threaded task:

Before the multiprocess comparison we need to benchmark the function single-threaded. The point is to find the number of iterations that results in more or less the same CPU time for each language.

I’ve targeted 2s so that benchmarks are comfortable to run but it’s still easy to feel the difference.

I’ve included Numba, Pypy and Nuitka in the mix just out of curiosity although they are not going to be part of the following benchmarks.

Results:

Single thread runtime barchart

Same graph without go to better compare Python environments:

Single thread runtime barchart python

Runtime# IterationsTime
Go8B2.11
Python250M2.09
Numba250M1.62
Nuitka-run250M2.1
Pypy250M1.6
Pypy300M1.9

Test Conclusions:

Golang takes the lead by a lot. It happens to be 317 times faster. If you do Go you may not need to use any multiprocessing at all.

Numba and Pypy gave a 30% performance boost while Nuitka did nothing for performance.

Multiprocessing test:

Python and Golang will differ a lot in this one. While Go channels are the obvious choice for fan-out Python is not so straight forward.

I used async-executor since it’s the most modern. It’s syntactically very different from Go but conceptually they are very similar.

I ran the function 50 times. It could be much higher but I wanted to keep the benchmark comfortable.

Python:

# stock importos omitted for the sake of brevety
import dataclasses as dtcs
import functools as ft
import itertools as it
import typing as t
from concurrent import futures

T = t.TypeVar("T")
P = t.ParamSpec("P")


def calcPi(iterations: int) -> float:
    "See the last section"


@dtcs.dataclass
class ConcurrentAsyncTaskController:

    executor: futures.Executor
    loop: asyncio.AbstractEventLoop

    async def request(self, f: t.Callable[[], T]) -> T:
        try:
            return await self.loop.run_in_executor(self.executor, f)
        except RuntimeError as e:
            print("Runtime error, either not in event loop or closed executor")
            raise e


@contextlib.contextmanager
def processContext(max_workers: int):
    loop = asyncio.get_running_loop()
    with futures.ProcessPoolExecutor(max_workers=max_workers) as p:
        yield ConcurrentAsyncTaskController(p, loop)


async def main():
    N_MULTIPROC = int(sys.argv[1])
    n_iter = 25200000
    N = 50

    ls = it.repeat(n_iter, N)
    fs = map(ft.partial, it.repeat(calcPi), ls)

    with processContext(N_MULTIPROC) as ctrl:
        async with asyncio.TaskGroup() as tg:
            coros = map(ctrl.request, fs)
            list(map(tg.create_task, coros))


if __name__ == "__main__":
    asyncio.run(main())

This code not only uses async but map/iterators as well to keep the code concise and readable. I personally favor this kind of syntax because I find it easy to test and easy to write memory efficient code with.

Go is rather different:


package main

import () // omitted for brevity

// calculates pi iteratively through a number of iterations
func calculatePi(nIter int64) (float64){
    // Same as before

func main(){
    nProc64, err := strconv.ParseInt(os.Args[1], 10, 64)
    if err != nil{
        panic(err)
    }

    N := 50
    chArg := make(chan int64)

    var wg sync.WaitGroup
    wg.Add(int(nProc64))

    // PRODUCER
    go func(){
        var nIter int64 = 8000000000
        n := 0
        for n < N{
            n++
            chArg <- nIter
        }
        close(chArg)
    }()

    // WORKERS
    for i:=0;i< int(nProc64); i++{
        go func(){
            for arg := range chArg{
                calculatePi(arg)
            }
            wg.Done()
        }()
    }
    wg.Wait()
}

In go there are not that many options. This code uses one channel to distribute the argument to each worker. Once all 50 arguments are dispatched the channel is closed. The main thread waits for all workers to finish through the use of a wait group.

It follows the patterns explained in the official go blog.

Results

Multiprocess scalability

Dotted line means improvement while straight line means raw time

# ProcPython Time (s)Go Time (s)Go Improv. (%)Python Improv. (%)
1100.77102.75N/AN/A
251.4451.8549.537748.9531
335.2135.431.726131.5513
427.0227.1923.192123.2604
521.7121.4721.037119.6521
621.9622.07-2.7946-1.15154
720.6721.781.3145.87432
820.5519.978.310380.580552
919.9320.38-2.053083.01703
1020.61201.86457-3.41194
1120.2919.71.51.55264
1220.6820.56-4.36548-1.92213

Test conclusions

It’s 100% a tie. Python has virtually no overhead.

Keep in mind that my machine has 12 threads but only 6 cores. This may help explain why there is virtually no improvement passed 5 parallel process for this example.

Argument serialization test

I was surprised by the last test. Python requires every object to be pickled and un-pickled to communicate in and out of a sub-process. Golang would use either stack or pointers to achieve the same effect.

This time I’m going to pass a very long string and each process will return it’s length. Go will have 2 implementations one value-based (stack) and another one reference-based (heap).

Python:

# SAME IMPORTS

T = t.TypeVar("T")
P = t.ParamSpec("P")


def count_l(in_list: str) -> int:
    return len(in_list)


def gen_l_str(n: int) -> str:
    return "".join(it.repeat("1", n))


@dtcs.dataclass
class ConcurrentAsyncTaskController:
    # Same code...

@contextlib.contextmanager
def processContext(max_workers: int):
    # Same code...

async def main_mp():
    N_MULTIPROC = int(sys.argv[1])
    str_l = 10**8
    N = 50

    s = gen_l_str(str_l)
    fs = map(ft.partial, it.repeat(count_l, N), it.repeat(s))

    with processContext(N_MULTIPROC) as ctrl:
        async with asyncio.TaskGroup() as tg:
            coros = map(ctrl.request, fs)
            list(map(tg.create_task, coros))


def main_sp():
    str_l = 10**8
    N = 50

    s = gen_l_str(str_l)

    for _ in range(N):
        count_l(s)


if __name__ == "__main__":
    # PARSE ARGS AND RUN MAIN_MP OR MAIN_SP

Go:

package main

import (
    "strconv"
    "os"
    "sync"
)

func strLen(s string) int {
	return len(s)
}

func strLenRef(s *string) int {
	return len(*s)
}

func genStrL(n int) string {
	s := make([]rune, n)
	for i := range s {
		s[i] = '1'
	}
	return string(s)
}


const N = 50
var strL = int64(math.Pow(10, 8)

func mainSp() {
	s := genStrL(strL)
	for i := 0; i < N; i++ {
		strLen(s)
	}
}

func mpVal(nProc int64) {
	s := genStrL(strL)

    chArg := make(chan string)

    // producer
    go func(){
        for i:=0; i<N; i++{
            chArg <- s
        }
        close(chArg)
    }()

    var wg sync.WaitGroup
    wg.Add(int(nProc))

    for i:=0; i<int(nProc); i++{
        go func(){
            for s := range(chArg){
                strLen(s)
            }
            wg.Done()
        }()
    }
    wg.Wait()
}

func mpRef(nProc int64) {
	s := genStrL(strL)

    chArg := make(chan *string)

    // producer
    go func(){
        for i:=0; i<N; i++{
            chArg <- &s
        }
        close(chArg)
    }()

    var wg sync.WaitGroup
    wg.Add(int(nProc))

    for i:=0; i<int(nProc); i++{
        go func(){
            for s := range(chArg){
                strLenRef(s)
            }
            wg.Done()
        }()
    }
    wg.Wait()
}

func main() {
    # RUN EITHER FUNCTION (mp or sp)
}

Results

I’ve isolated out the time it takes to generate the string since it may not be trivial and it should not pollute the test results.

Both cases generate a string of 100 million ‘1’s and get it’s length 50 times.

‘Mp’ means multi process while ‘sp’ means single process.

String multiprocess benchmark

Same image without Python multiprocess:

String multiprocess benchmark without python-mp

CaseStartup TimeTotal TimeTask Time
Python sp0.8060380.8845130.0784755
Python mp0.80148385.985285.1837
Go sp0.6221680.7601120.137944
Go mp value0.6056470.7735280.167881
Go mp reference0.624730.7288040.104074

Test Conclusions

Here we can see the weak points of Python multiprocessing. Note that objects are pickled when using queues just as much. If you have to pass large in-memory objects don’t use python multi-processing.

I’ve skipped the number of processes used from this table since time is not affected by it. I’ve scaled up to 50 processes.

Even more interesting, memory usage doesn’t seem to scale with processes. Profiling memory is much harder than time, Just by checking htop you see the spike in memory but it clearly does not scale with the number of processes.

I would like to test if pickle is caching the results in some way. That would explain how the first serialization is O(N) and O(1) after that…

Memory usage is comparable between the two implementations just by looking at htop.

This test is rather rudimentary. I’d love to follow it up with more practical examples.

Conclusion:

I was surprised twice when performing this tests. First when I saw how abysmal the single-threaded performance of python is and then again when I found out how good python is taking advantage of the multiple cores of the machine.

Numba and Pypy proved to be worthy optimizations but they are very far behind what a compiled language can do.

Python multiprocess executors are good for small memory and big-cpu tasks but not for big-memory tasks. Python is not suitable for high-performance single threaded applications (not even Numba).

It would be interesting to test other tools related to Python such as Mojo or Taichi. These are full-blown programming languages so you cannot use the Python ecosystem. I find it easy to switch to Go (or Rust for that matter).

As machines scale to more and more cores Python will be a viable choice for somehow cpu intensive applications where latency is not critical.