@@firstname4337 Yes, because if you use numpy you use C. That is the whole point of numpy, to not use python for operations that are repeated millions of times. Alex' point stands, because in the numpy case we use Python for everything that is not time sensitive, but for the thing that is time sensitive we use a small C monster wrapped in a thin fluffy Python blanket.
@@HiltownJoe Have you heard of PyPy. I ran his source file in pypy3 and got less than a quarter of a second for all methods except sum generator and sum list comp (0.72 seconds for those)
Clearly not a protip. A professional understands the constraints of their environment and the project they're working on. The choice of programming language is not (always) up to the individual writing the application.
I'm a physicist and my experience is that when you spend the majority of your time writing the code, and running it takes only a few seconds, reducing the development time is much more important than reducing the run-time, and that's why I like Python.
Hi Peter, well I've been a software developer delievering a (Basic) scriptable technical measurement data PC-application NI DIAdem, and we put most of our development effort into giving our users like you (engineers & physicists) optimized script commands, which behave similar to that of Python. So using already implemented loop command may give you better performance than own loops as described here.
that's the thing tho, they're not mutually exclusive, you can have a very easy to write and a decently performant language (look at julia) why python is slow is not bc it can't go faster or bc it's a sacrifice for expressiveness (as there are python implementations that are way way faster but that lack total compatibility with libs) The reason why python is slow is because it's poorly designed. It uses linked lists for lists instead of vectors (log of n vs log of 1), it compiles to an intermediate language but it leaves it at that, there's no JIT in Cpython, the global interpreter lock impedes people from writing super performant code, etc, etc Python is slow bc of the choices they've made over the years, they're consistently made choices that affect performance, they never took the time to think about it and see how they could implement abstractions that would not take a big toll on performance, they just added things on top as they saw fit. But that's not a big surprise once you see that python was a pet project of Guido and he just made it bc he could and people happened to pick it up and spread it (even tho it's performance and features were worse back then)
You're using Python for simpler scripts, like what it was made for. That's a valid use case. Making large systems in Python is just stupid for the simple reason that it will be expensive to run.
I'm in my 3rd year in Computer Science and trust me, I've watched SO many youtubers that do this and you're by far the smallest for an unknown reason. You explain things really well, you don't beat around the bush and overall just make quality content. :D I'm now subbed and I'll be coming back for more :D
Fun fact: A good C compiler would detect this common pattern (summing sequential numbers) and replace the for loop with the direct calculation of the result. And if even the number of elements was hard-coded, the compiler would even simply replace the whole code with the hard-coded numeric answer.
I always wondered what's the point of this? Are series we have summation formulas for really coming up in a real software? As far as I can tell the only application is to show it in a presentation about optimizing compilers. And frankly, this is even a wrong showcase of optimization, cause the useful techniques are general and pliable enough for compiler to reason about. This is just a hardcoded nonsense.
@@Czeckie this exact problem isn't common. But good compilers can recognize loads of (small) patterns in a codebase that, when combined, can amount to a large speedup. Sometimes some optimized chunks can again be combined into a new, smaller and/or faster, chunk.
@@oj0024 Yes, if you write shitty enough code the compiler cannot fix it for you because it can’t make assumptions about what you’re trying to accomplish. This is not a revelation or even a shortcoming, its a feature.
@Matt Murphy He is right that compilers can do it, but if we look at the two most modern compilers right now only one of them can detect it and only if you write the code in a non-obvious way. The second part showed that compilers don't do it with constants either, they just execute the code internally. I'm not saying that compilers aren't doing great optimizations, just that the claim that "A good C compiler would detect this common pattern" is wrong. Especially since such optimizations don't gain anything (I'm talking about the pattern transformation, not the constant evaluation). They can only be done in very few cases, and you usually assume that the programmers know what they are doing. (e.g. the only reason to implement the summation code would probably be for benchmarking purposes)
@@joestevenson5568 clang didn't detect the pattern when using a normal for loop, it only detected it if you write the non obvious and usually not used while(n--) loop.
5:08 It's not about NumPy being primarily written in C (so is CPython), but about Python' dynamic vs. NumPy's static typing. In a dynamically typed language like Python, every time the interpreter encounters an expression like c = a + b, it will first have to check a's type (and find out it's an integer). Then it checks b's type (and finds out it's an integer too). Then it checks if there's any known operation to add (+) two integer types (of course there is). If you do this one million times in a row (e.g. in a loop), the same checks will be performed over and over again. It's like asking "Is a still an integer? ... What about b? ... Do I know how to add two integers?" one million times in a row. And of course all of that happens at runtime. That's why Python is so damn slow in that regard.
Yeah... that's the price you pay for flexibility. But there are ways to overcome this. They all require some kind of hinting that the structures you use contain only elements of the same type.
> In a dynamically typed language like Python, every time the interpreter encounters an expression like c = a + b, it will first have to check a's type (and find out it's an integer). Then it checks b's type (and finds out it's an integer too). Then it checks if there's any known operation to add (+) two integer types (of course there is). A dynamically typed language implementation doesn't *have* to do things this way, that's just the naïve approach (albeit the easiest approach out of the box for new implementations). A smarter specializing JIT can create precompiled versions of functions and expressions for commonly used types (e.g. integers), and can determine that they are invariant and avoid having to re-check the types each time you hit the expression. The default CPython implementation does things the way you describe, but alternate implementations like psyco and PyPy take the specializing approach in their JITs. Not surprisingly, PyPy executes these for and while loops on the order of 50-100 times faster than CPython does on my machine (and ~10 times faster than numpy). $ python3 benchmark.py while: 5.254076757992152 for: 2.9640753349813167 sum_range: 1.1016407490242273 sum_numpy: 0.6697274879843462 $ pypy3 benchmark.py while: 0.05818561598425731 for: 0.07444041498820297 sum_range: 0.07667819698690437
I hope this problem can be partly solved if people declare data types of nearly all the variables as they need to do in C before hand like I do most of the time
@@sumnerhayes3411 There are tons of other reasons why dynamic typing is pointless and bad. It saves you having to *gasp* tell the program the type of a variable one single time; and it costs you dearly. All python projects should enforce static typing, even though that doesn't gain performance. All languages should be statically typed, period, because it is infinitely more readable and provides no development benefits at all.
@@kylehart8829 I would have disagreed prior to learning Python, back when it was just dogma in my head. Interesting how that works. However, the learning process was made easier by removing a step. I think a better solution would be to make it extremely clear how to toggle Python into enforcing static typing, and why that's a huge benefit.
Sometimes it’s just impossible to express something in numpy - iterative algorithms, for instance. I had a very hot piece of code, which implemented an A* pathfinding algorithm, which was slowing my entire program down. I looked for solutions and found a package called Numba - it allows you to decorate a single function and it would try to compile it the best it can. There are various optimization settings to consider, and in order to get *significant* improvements in performance (like, order of magnitude and more) you basically had to rewrite your program as if you were writing in C, but using Python’s syntax - so, no dynamic typing, using very limited subset of built-in types and refrain from using custom classes as much and possible. But it didn’t require too much work, and it was a great benefit in the end - I ended up with a blazing fast function without needing to write it in actual C (not very hard) and then figure out how to couple this with Python (quite finicky if you’ve never done it). So yeah, I’d recommend looking at Numba if you need a more general solution.
@@NJ-wb1cz Eh, rust is likely one of the hardest languages to learn (after "modern" C++ with template metaprogramming). Julia seems quite easy to learn and pretty fast (if strongly typed it's close to C, maybe twice or three times slower, but not two orders of magnitude like Python). On the other hand I'm not sure how to structure code which is too complex for a single file and too simple to put it into package. It was basically developed for data crunching. Also year or two ago Rust was lacking some libraries, especially for plotting graphs (C++ lacks them too), painting (Cairo) and even exporting data as images (no support for writting of 16bit PNG files)
@@NJ-wb1cz right, who would do something as stupid as just adding another library as dependency when they could add an entirely different development stack just to optimise one function.
@@julians.2597 why is adding libraries stupid? That's a strange stance. Sure, moving to another language is an option, but you'll likely use libraries there as well.
Very good video. Note: NumPy isn't just C, it's highly optimised C with SIMD capabilities. The only way you can go faster than that is with GPGPU like OpenCL, CUDA, Vulkan or Metal. (And even that isn't assured since GPGPU has initial overhead.)
Half through the video I thought: actually for this simple sum you don't even need to loop just the formula. And you didn't disappoint. While this might seem like a very synthetic example which won't often be possible in real life, you'd be surprised how many times I (a programmer who is not even very good at maths) have been able to replace large loops and delete large chunks of code and replace it with a simple formula. Not usually this clean and simple, but quite often you find that the code is looping through a lot of data calculating stuff and then filtering out stuff, and if you simply rearrange things you might be able to do the filtering before the loop which might even reduce the loop to a few statements without a loop. Very often this is not obvious in the code but very obvious when sketching out the desired results on paper. Pencil and paper is definitely not obsolete.
@@mCoding it's the moments when you're lying in bed and it kinda hits you out of nowhere that are the best Of course, it is kinda a let down when the divine revelations are actually wrong though
You can actually be even faster than with math, if you know how CPUs work. If you instead of calculating (n * (n -1)) // 2 just calculate (n * (n -1)) >> 1 it should (in theory) be a few clockcycles faster. While most math operations are pretty trivial for the CPU(+, -, *) division is a pretty annoying task, so you might spare some time with just bitshifting.
@@delqyrus2619 Bit shifting is math too. No different than multiplying and dividing by powers of ten being easy in decimal. A good compiler will optimize it though, so it's probably best not to bother.
You want to be careful when timing functions, your processor likely speeds up or down based on system load and ambient temperature changes. Although you can still capture order of magnitude changes very easily like you did. If you're on a laptop though where cooling and power are limited, take extra care, and run multiple simulations in different orders at random times!
One thing you can do is to disable CPU boost clocks during benchbarks. This way, you get stable clocks regardless of load created by the tested function. For example on Linux, you can just write 0 to /sys/devices/system/cpu/cpufreq/boost to disable the boost or write 1 to enable it again. This’ll get you much more stable results that are comparable even if ambient temperature changes.
@@luziferius3687 that's a good approach! Myself I usually try to do perf stat whenever possible, and count the actual CPU cycles it takes. Run each command say 3 or 5 times and take the median!
If you're using a counter within the while loop, then yes that counter is slower than the for loop. If you don't, and you use the while loop as intended and doing the same stuff inside the loop, then it is essentially the same speed. def while_loop(n=100_000_000): s = 0 while s < n: s += 1 return s takes about the same time as: def for_loop(n=100_000_000): s = 0 for i in range(n): s += 1 return s Use the right loop for the right task. If you have to put an incremental counter in your while loop then you're using the wrong loop.
Comparing the sum with the counter limit certainly won't give the correct result in s: We are calculating the sum of all integers and analyzing various ways to do that. I. e., if the limit where 10, with 1,2,3,4 we reach it, way before reaching the correct sum 55.
This was a valuable lesson to learn when I was coding a neural network. I wasn't aware of this fact and was trying to parallelise, before realising numpy automatically did that. Turning everything into vector operations led to a speed increase that was crazy.
3 роки тому+225
This is the reason why python is such a good programing language. You can write your program with easy and pretty syntax and use packages wrapping C code to do the complex and time consuming tasks. This just shows that you can use python and have fast running code. I would also add that if your program's only focus is speed and performance then don't use python, simple as that.
@@HydratedBeans unless you depend on oython-specific libraries I don't think the development speed is noticably greater compared to, say, kotlin. And a loosely typed language will always have more mistakes and harder to statically check
"The best way to loop in python is to not loop", hahaha, loved it. Also got curious about doing the numpy sum but with the built-in range (because of the memory problem), how much would that impact the performance?
In my experience, I've never hit a RAM limit because of this, but in memory constrained systems this is definitely something to watch out for. I wish numpy has something like std::ranges::iota to use as an array-like.
@@mCoding There is a library called dask, is used in distributed computing and has support for numpy arrays. working with blocks consume less memory but might be a bit slower.
i think numpy.arange(...) creates a potential memory issue (it allocates the a large array in this case), but the Python3 builtin range(...) doesn't allocate a large array - it is a lazy iterator that simply tracks the last value generated so it knows what to generate next.
There's a fun story about that summation formula. Apparently mathematician Carl Friedrich Gauss came up with it at a young age when his teacher set his class the task of adding up all the numbers from 1 to 100 and was very surprised by Gauss' solution. Regardless of the veracity of the story, the formula is indeed often referred to as the Gauss summation formula. Mind you, a lot of things in math are named after Gauss as he was extremely prolific.
@@thomas-sk3cr which actually illustrates a pitfall of using external math formulas instead of letting the computer do the work, that they are going to be less clear about what the code is supposed to do. getting the wrong answer really fast is worse than getting the right answer slowly
@@adrianmizen5070 not sure if I understand you correctly. I do think that it's definitely better to use a formula if there is one. But it should be properly documented of course and you have to make sure that it's the right one. Maybe we can agree on that? I don't think you should reimplement well-known formulas with loops just because it's more clear to the reader at first glance :)
Oh this is an amazing recommendation by youtube. I came to the same conclusion as you while having massive performance issues in my machine learning models in non obvious parts of the code. Once I noticed that it is the loops that are awfully slow I tried refactoring them as much as possible and ended up using a mixture of things u've shown here. I didn't even know or speculate that it's because of C but now everything is much clearer in my head. Thanks!
you're not supposed to do ML with for loops in theory you can and it's very undestandable if you do it that way but any solution using matrices/tensorflow will be faster. Like he says in the video numpy is fast
Oh snap! I thought I was saving on resources by not using additional modules like numpy but now I see it's ridiculously faster. I wasn't aware pre-built functions were that much better either. Thanks for the reveal!
He actually did say in the video that Numpy uses a lot more resources, and with a bigger number, you may not have enough memory. Your original thought was correct, that you do save resources by not using additional modules. You may save time, but not resources. Also, if you compile your code, those additional modules will need to be compiled in, which will make the executable larger, and require more resources as well. As we are quickly approaching the limits of how far we can go with hardware (CPU technology is pretty much capped out, unless someone figures out how to make atoms smaller), we need to be more mindful of the amount of resources we use. Programmers need to stop being lazy. They have to stop thinking that they can bloat their code with hundreds of modules because CPU's will be faster by the time their code is in use. Just because speed and bandwidth are available, doesn't mean they should be used. For example, I know of some websites that include 50-100MB of Javascript code for each page view. Sure, dial-up is a thing of the past in most parts of the world, and people have faster internet connections than ever before. The page loads pretty fast on a gigabit connection. What about the people who have much slower connections? Is it really necessary to design your website where every single page view uses an entire day's worth of bandwidth on a dial-up connection? If you have several pages of the same website open in different browser tabs, your memory utilization goes through the roof! I've seen a single browser tab on one of those websites using over 2.5GB of RAM! I haven't even checked how much unnecessary Javascript is downloaded when using something like Facebook, but I've seen a single tab with Facebook open using over _5GB RAM_!!! This kind of waste needs to stop. Even if it does save a second or two here and there, it could cost several more seconds (or longer) in other areas. I've always hated importing modules in any language I've ever programmed in. I would much rather write my own code to do what I need done, whenever possible. I realize that's not always possible, but it's always better than adding bloat to a program that's not needed. Look at what's known as the "demo scene". It's something that started LONG ago, where people try to make the most exotic program possible in assembly language that fits within a 64KB executable. People have fit 10 minute long full screen videos with music into 64KB! They have created complete first-person-shooter games in 64KB! These programs run plenty fast on hardware from over 30 years ago! If more of todays programmers took the time to write good code, we wouldn't even need computers as fast as we have. Sorry, that started off as a 1 line reply, and ended up going 100 directions. There's still soo much more I want to say about the topic, but who is going to read it anyway?
Thanks for this, I was trying to explain to people in JS that essentially always calling a native function is faster than writing a more optimized(in terms of Big-O) way of doing something. They were asking Big-O proof and didn't believe me that it's just not relevant. Looping through a string 10 times with a native regex check is faster than iterating through it once in pure JS.
It was the one addition statement executed many times that slowed the python "for" loop. The conclusion is that looping can be very fast but executing python statements anywhere in the code is very slow.
Another reason the numpy approach is faster: The elements of a numpy array must all be of the same type, and there's no overhead to check (and convert to numeric, if necessary) the type of each array element.This probably contributes much more to the increase in speed than the language difference (C vs. Python.)
In compiled languages loops are often optimized on compilation stage, such as being unrolled where necessary, with particular emphasis on efficient cache/prefetch from memory utilization. One should look on original Fortran code of LAPACK routines where some loop optimizations were done 'by hand'
@@dmitripogosian5084 loop unrolling is never efficient when cache and prefetch is the topic unless you explicitly wrote "from 0 to 4" just to avoid copy pasting code 4 times by yourself, and even that may not be worth unrolling depending on body of the for loop which most compilers can decide by themselves if you do profile guided optimization pass. There's a reason why -O2 more performant for most software than -O3 and unrolling is one culprit of that.
Its not incredebily slow, compared to others its slower yes but it can acomplish the tasks its meant for with no issues Machine learning, Automation, Data Science and Cyber Security
Great comparative video The math solution if you include n in the addition is ((n+1)*n)/2 Because you used range() and the while loop was set to < and not
Flooring the division is very much recommended, because otherwise python will convert the result to a floating point number which may cause you to lose precision
@kurogami2771 it does in this case as we are adding up a series of numbers from 0 to n, which by definition excludes negative values. As for flooring, it is true that we should avoid using floats when not needed even for just one operation so we can either floor or int() the results.
@@abebuckingham8198 edit: "either n or n-1 is even so it's never a float" is not correct because in python division with / will always result in float conversion (see type(1/1)). the rest of this reply is just trying to clarify what I now realize was not necessary to clarify in my comment I said "flooring the division is very much recommended" but what I really meant was not using the floor operator on the result but instead using the integer division operator. in python if you do a/b the result will always be a float bu if you do a//b then the result will be an integer if both a and b are as well. doing a//b is basically like doing int(a/b) except that there will be no conversion to a float and so no precision will be lost. for example, int((10**18+2)/2) evaluates to 5e17, because the division was a floating point one, but (10**18+2)//2 does result in the expected value of 5*10**17+1
Usually recursion is slower and takes up more memory. And many "big" problems cannot be coded recursively due to stack size limitations. But if compiler/interpreter can do tail recusion optimization, it's a very nice option. Unfortunately, not all problems are applicable to it.
The CPython interpreter doesn't do tail recursion so even if you write it out correctly it's still faster to loop, save the recursion for when you forget how to convert your code into a loop correctly
@Sec Coder Check out Lisp or Prolog. The recursion is conceptual in terms of the problem or model. Imagine a family tree. The definitions and relationships are recursive. Mom and Dad produce offspring each of which in turn can become Mom's or Dads who produce offspring. Comes into its own when these Relationships are highly complex ( Imagine every leaf on a real tree was in fact another tree and so on and so forth. The Mantlebrot set is recursive ) Essentially they become huge lists or recursive data structures. Those languages became prioritized and optimised around lists and recursive data structures.( With a few constructs and language primitives not found in the likes of C or Pascal such as "is a type of " "is connected to" "is related to", "is derived from" etc etc ) Pythons modelling capabilities Tuples Lists etc etc Means most of it can now at least be modelled easily before being translated to something else for speed. Ps for fun check out Occam . It's editor was indent sensitive just like Python It also let you mix sequential and parallel programming .!!!! Oh lordy 😁
@Sec Coder It lets you express common algorithms in a purely functional way, i.e. without needing to mutate variables. Doing so makes your code easier to analyze mathematically and makes it more composable.
Thanks for the tips! The comparison with NumPy is not fair because it is made for array computing and hold an unique data type. By the way, you deserve far more subscribers.
I really like your channel! These are very interesting insights into the Python language in small videos which makes them easy to digest. Keep up the good work!
Rust with 64-bit numbers looks to be about 50× faster than numpy on my box (once I convinced the compiler to stop solving the problem at compile-time without losing optimization). Rust with BigUint bignums looks to be about 5x faster than numpy. The obvious for loop with pypy3 is about 12x faster than numpy. So there's still plenty of room for speedup.
Expertly presented. I never would have thought to look at it where C was used behind the scenes. I will definitely try and make the better to worst progression habit when writing Python.
Half way through the video I thought to my self why even loop to find an answer to a simple arithmetic progression and then I was surprised that this was addressed as well. Wow, impressive.
Just wanted to note that using an iteration index in a while loop isn't really good practice anyway, since that alone would almost always mean a for loop is the correct pattern. So it's not really surprising that the while loop was slower in this experiment. I'd also like to warn people that numpy is only faster if you are operating on large sets or large arrays of numbers. If you apply a numpy function to a single number, it's almost always much slower than the equivalent function in Python, presumably due to some overhead.
My own results were 6.06 seconds for the while loop, 4.25 seconds for the for loop, and 569 ms for the numpy loop. I tried it in Julia (almost identical code except for the end tags), and the results were 1.3 ns for both the for and the while loop. Checking the lowered code it turned out it had optimised the loop away completely, so we got sum_math for free! So instead to force it to do the actual sum, I modified it to s += i * (rand() < 0.5), and it came in at 576 ms. So even with a call to rand() and a branch, Julia was still nearly 10x faster than raw Python, and was about on par with NumPy (which didn't have a call to rand() ). If I force Julia to construct the range object with collect(1:n), then I don't need so that it matches NumPy, then it gets even faster at 290 ms, but now it allocates 763 MiB of memory, so clearly there are pros and cons of using range objects, although I'd stick with the range object for most usages. So, if you like Python but you want C performance (and you like to use arrays a lot and don't like to add all the extra numpy syntax just to do that), maybe check out Julia.
Most like myself use Python for the libraries, nothing else comes close except for maybe C++. It would take a lifetime to recreate all the libraries in Julia.
@@incremental_failure there has a lot of ongoing work to make native Julia libraries (mostly in the scientific and statistical communities), but Julia can also call C, C++, and Fortran code, and the PyCall package already allows you to embed Python in Julia, so you're never left in the lurch if you choose to use Julia. I've already seen Julia be hugely faster and more flexible when doing a lot of ODE modelling than when using R and calling deSolve which is written in Fortran.
Your channel is great! I suggest you use a high pass filter for your voice to cut off the low frequencies that make your voice sound muddy, great job so far.
It would be a good idea to show examples that does a bit more work within the loop, to show that if your loop body is costly then how you loop is largely insignificant.
I like this video very much. Subscribed. Also, interestingly enough, I am taking a class about functional approach to engineering problems and the professor repeats every class to avoid looping like you would avoid the Black Plague. Lol
This is very informative and well presented. Also, it seems like you are very talented in explaining stuff, but this is only the first video of yours I have watched, so I'll have to do some further benchmarking ;) Thank you!
Numba would make it closer to the unoptimized C/C++ code. I used it to render fractals and I got it to run 50x faster than unoptimized Python, and almost the same speed as the C code. The thing is ... that you can make the loop a lot faster in C too because you can use SIMD intrinsics, so you could do 4 or 8 or even more operations in a single instruction.
Loved the video, certainly gives me some ideas for my own code optimisation. However, I think your maths is off? In the arithmetic summation formula Sn=n/2(2a(n-1)d) being simplified where a=n, you've forgotten that the common difference will be -1 instead of +1 as we're counting down, not up. This should simplify to (n*(n+1))/2 instead of (n*(n-1))/2? Tested with the numbers 1 through 10 yields 45 on your formula, 55 for the correct answer.
I think numpy also will generally, if it can, use parallel CPU operations (SIMD) to process the array, which is probably why it's faster than standard sum() (and also why it makes sense to create the array of data with arange() even if each value could be calculated in the loop. But definitely good to be aware of this memory requirement.)
@@mCoding I did a small test. from numba import njit @njit def int_sum(n=1_000_000_000): s: int = 0 for i in range(n): s += i return s timeit.timeit(int_sum, number=1) 5.373000021791086e-06 timeit.timeit(int_sum_p, number=1) 44.33714248199999 int_sum_p is the same function, without decorator. The result is very close to C speed. But obviously, the best improvement comes from doing the maths properly.
Numba is amazing, if you can limit the code in the @jit’ed routine to simple loops and math and/or numpy operations. You can’t use things like lists or dicts
Should do all the examples with numba this is also interesting. Cython also. Masterclass in squeezing loop performance, could also show sum(x for x in range(n)). Ideally use matplotlib to show the data for different sum sizes and prove the obvious that it is linear
@@gregorymorse8423 When trying to make a comparison with compiled C++ code, I discovered that clang removes the loop altogether, applying the maths that the programmer should have done in the first place. And I suspect that numba applies the same optimisation, otherwise the total time does not make sense with the clockspeed I have.
Great video. Commenting mostly to help the algorithm push this. Would have loved to see a comparison with straight C thrown in there given that C code was the source of the speedup. What's the cost of the python to C transitions?
A C compiler would probably compute the answer at compile time so there is nothing to compare! A Python to C transition would be a huge undertaking even for a small project.
Here are the results on my pc: python with for loop: 8.74s c with gcc and no compiler optimizations: 0.235734s c with gcc and full compiler optimizations: 0.048770s (gcc doesn't detect the pattern and actually executes the entire code) c with clang and full compiler optimizations: 0.000001s (clang detects the pattern and optimizes it using the Gaussian formula)
@@mCoding Nuitka will take Python code and transpile to C (which uses the Python standard Library to do the object iterations) - on some applications it is 2x faster than normal Python (and it is a work in progress).
@@mCoding Nothing in particular just any fancy pythonic things which help improve performance and code readability. Also, one thing you can do is make a mess of a code and start refactoring it. That would be amazing to see.
If you are writing pure and iterative heavy Python code, then PyPy is your best bet for performance. I ran the same test as mentioned in the video on Python 3.10 and PyPy 3.9 and my god the difference was staggering! Here are my results : Python 3.10: for loop : 6.6 secs while loop : 11 secs PyPy3.9: for loop : 0.22 secs while loop : 0.36 secs PyPy is really meant for long running code and real world results show that.
Awesome video! A few days ago I was comparing the execution time on using a loop vs. the Gaussian formula, it's nice to see some other options in between.
Very informative. Excellent insight into Py inner workings. Great incentive to employ Numpy (and Multiprocessing) whener possible. Give a simple yet splendid example of why knowing math is important for any programmer who tries to use Py for data science type applications.
I had heard about this before, thanks for sharing, it makes a lot of sense in safety critical applications. I know NASA still uses C for (as least some?) space shuttles!
@@mwanikimwaniki6801 Ada has been specifically designed to catch as many mistakes at compile time. It’s static analysis tools are extremely thorough, and in general, it’s designed to prevent undefined behavior, from its type system, to its syntax. It’s an interesting language.
If I recall correctly there was an optimisation in LLVM/Clang that was able to detect looping over elements in order to add them and inject the n(n-1)/2 directly into the assembler. :)
That is ridiculously clear, it blew me away. Straight to the point, extremely clear explanation, very methodical, and everything is shown perfectly. Just purely amazing tbh.
def recursive_loop(n=100_000_000): s = 0 def loop(i): nonlocal s s += i i += 1 return loop(i) if i < n else s return loop(0) > *RecursionError: maximum recursion depth exceeded in comparison* import sys sys.setrecursionlimit(100_000_000) > *segmentation fault (core dumped)* 😂😂😂
It actually does not affect the number of iterations. The iterator of range keeps its own internal counter that python uses to assign to i each iteration.
It's also worth taking a second to remember that even in its slowest form, the python code is adding up 100 million numbers in 20 seconds. It's slow, but "slow" is a relative term.
On any CPU capable of many teraflops, 20 seconds for 100 million numbers means eons. It's like walking and taking a step only every 5 seconds, short distances It's not that bad, but will badly scale once you need to travel farthest.
"Is there anything o beat numpy?" Try compiling numpy top native machine code using numba: @nb.njit(cache=True, parallel=True) def numba_sum(): return np.sum(np.arange(100_000_000)) > 0.00064 seconds Or suing your GPU (Cupy): def cupy_sum(): return cp.sum(cp.arange(100_000_000)) > 0.001747 seconds Cupy should be an order of magnitude faster than you CPU, but only if you do some more complex computation than calculating the sum of the array.
for people more interested in the formulae, it's for sum of n terms of an AP the formula is: (n/2)[2a+(n-1)d] where, a = starting term d = common difference n = till nth term
@@mCoding I think you should be more clear regarding the difference between language and implementation. It was definitely an interesting video, but this is not a problem inherent to the language itself, but rather a problem of (probably?) CPython. Also, C doesn’t optimize anything, gcc does. At least if you don’t tell it otherwise.
I personally like to start with python. Its vast libraries and simple structure is easy for me to test programs. Later when I decide on a specific implementation, i'll code that in C++.
In reality, python is the first thing most people will use because it is easy to use. Python is only usefull because if works as a glue for other languages. Easy access to libraries written in more difficult languages.
Actually any modern C or C++ compiler with optimization turned on (-O3 or similar) would by itself compute the value of the sum compile-time and just inline the result. I am shocked that Python interpreter does not even try to for example throw out unused instruction during compilation. There must be a way to enable this. Or maybe this instructions actually have a side effect because of its weird namespace policy and cannot be thrown away without potentially spoiling the code. I am not sure...
Both C and C++ compilers such as GCC and LLVM are written in C++. Python is written in C. JavaScript is main Python's competitor in easy to use scripting language... And it's V8 is written in C++ and its JIT is incredibly fast, the only thing that can compare is LuaJIT which is practically abandonware because one dev who made it has stopped caring and noone else is intelligent enough to pick it up and maintain it properly. Still, the difference is that LuaJIT is very fast... But it can't stream code and execute it in my browser as fast as possible, most projects don't even use lua for one reason or another. Just shows that C is obsolete language and all good programmers use C++. There of course will still be C zealots that will pretend otherwise, but they can't even write their own compiler and have to depend on C++ programmers who they hate more than they hate C++ for some reason.
Yes I strongly agree: 1.) optimized algorithm over loops, 2.) optimized (not runtime interpreted) loop-code, 3.) if none of the former is possible slow interpreted loop directly in Python. Where (3) is ok for best flexibility expirience when learning or developing & debugging inside new algorithms. I've been a staff software engineer (development & quality assurance) delievering a (Basic) scriptable technical measurement data PC-application NI DIAdem, concentrating much effort into giving app-poewer users as script-programmers (engineers & physicists) optimized script commands, which performance expirience behave similar to that of Python. So using already implemented looping command may give you better performance than own loops as described here.
that is why you don't do intensive processing in Python itself, but rather use packages that implement the functionality you desire directly on C that said, the best way to loop in Python should always be to use the simplest construct that fits your use case, since Python gets more value from readability then from scavenging microseconds of processing it's also always good to import this
It's not really a fastest way to do it, it's always faster to use a math formula to compute a thing, if a formula exists. The purpose of using Python (or any other language) is not to replace math formulas but implement an algorithm to solve problems in an easy way (whatever easy means). Another way to be faster, is to implement the operations you need in assembly, but doing so you miss the point. Math is faster than assembly, assembly is faster that C, C is faster than Python and Python is faster than bash. But they are different languages and need different skills. The correct answer is: Prefer "for" over "while" to loop in Python because it's faster.
Did you run these tests with Cython? Probably bringing everything to C would have interesting results. The async model on the for iterator for example should not be faster than a while loop unless the async code were optimized out.
The point about math is incredibly true: I optimized my VarInt bytesize computation algorithm from iteration to bit operations and lookups, and it now runs (at 2^32 loops) in ~2s vs ~60s
while loop 6.429943797 for pure 4.258419519 sum range 1.4508647020000005 numpy sum 0.2784187069999984 math sum 1.868000000015968e-06 iMac i7-10700K, Python 3.9.2, macOS Big Sur
while loop 5.056872800000 for pure 3.253743700000 sum range 2.379359199999 numpy sum 0.138200499999 math sum 6.299999999015e-06 Desktop PC with AMD R5 3600, Python 3.9.2
Strange that my math sum is so slower than the Intel iMac while my other operations are faster. If I was to guess, there's something in Windows that adds a little delay before or after the operation runs. Might be that it takes a bit longer to call the function, but short enough that it is drowned out in the longer operations where my CPU runs faster.
@@jensdanbolt6953 while loop 4.5906675 for pure 2.9198796 sum range 1.9685457999999993 numpy sum 0.10591909999999949 math sum 2.799999998970293e-06 the same iMac, i7-10700K, Python 3.9.2, Win10Pro (Boot Camp)
@@atvarsgr While we are at it here is the contrast to a c implementation: python with for loop: 8.74s c with gcc and no compiler optimizations: 0.235734s c with gcc and full compiler optimizations: 0.048770s (gcc doesn't detect the pattern and actually executes the entire code) c with clang and full compiler optimizations: 0.000001s
Let's also add to this that it's not just the speed that matters. Choose the right abstractions. If your language can only say "while" and "for", that's a lot to write and a lot to comprehend to work out the meaning from low-level statements like "increment" and "compare". The code is much clearer if it literally says "sum a range". The more iteration abstractions you have, the clearer the intent of the code.
Thanks for that! I had no idea about the ratios. Two ideas though: I'm doing statistics and ETL jobs with pandas. I would've been interested in other methods as well like recursion or doing the same with pandas instead of numpy (whether it's any slower than pure numpy), pandas + apply + lambda, etc. Second, going one step further and optimising performance for if-elif-else / mapping function would be great for another video.
The conclusion: "maybe you didn't need that loop in the first place". I like it. Indeed loops are the piece of code that is really worth to think through. Maybe part of computations is the same in each iteration, then calculate it before the loop, then in the loop just use the result. If you can speed up a critical loop at the expense of code readability, do it, but add comments why and how it's done.
Great video! Really informative. I'm surprised by how much faster numpy sum is over the built-in sum. As an aside, the fastest solution would be to use the sum(1...n) = n(n+1)/2 formula. The fastest loop is no loop at all :)
I got these results: while loop: 3.0520888 for loop: 2.9422460999999998 By removing the unnecessary i variable in the while loop: def while_loop(n=100_000_000): s = 0 while s < n: s += 1 return s That i has to be type checked by the interpreter in every loop. You pointed out the range factor yourself, which also contributes immensely to something not essential to the speed of the loops.
I just beat numpy with a standard C++ program using a register ... while loop took 2.5 seconds to complete, numpy took 123ms approx. My solution took 56ms, compiled with g++, 65ms with gcc. Without register flag on the int, time jumps up to 66ms and 72ms respectively. int main(){ register int rbx=0; for(int rax; rax
You know, if the time difference between a while loop and a for loop matters to your application, Python might not be the most appropriate language.
Comparisons are expensive and iterating is cheap. That's true in assembly, C and Python.
WRONG -- he just showed that numpy works fine
@@firstname4337 Yes, because if you use numpy you use C. That is the whole point of numpy, to not use python for operations that are repeated millions of times.
Alex' point stands, because in the numpy case we use Python for everything that is not time sensitive, but for the thing that is time sensitive we use a small C monster wrapped in a thin fluffy Python blanket.
@@HiltownJoe Have you heard of PyPy. I ran his source file in pypy3 and got less than a quarter of a second for all methods except sum generator and sum list comp (0.72 seconds for those)
Clearly not a protip. A professional understands the constraints of their environment and the project they're working on. The choice of programming language is not (always) up to the individual writing the application.
"The Fastest way to Loop in Python is not to Loop in Python" Great sentence.
Talk about understanding the letters, but not the meaning.
To loop,or no to loop that is the question
@@samuelhulme8347 Hahaha DEEP
@@samuelhulme8347 ok, nice job, now im lost ..lool
"a strange language. the only winning way is not to use it.
how about a nice game of chess?"
pro tip how to boost python: use C
or C++, either choice is good.
C++ should be the better option here, cause of the OOP support.
Or any compiled language, really.
Assembler the only real choice.
@@Jonas_Meyer You're not a real programmer until you can program in assembly
I'm a physicist and my experience is that when you spend the majority of your time writing the code, and running it takes only a few seconds, reducing the development time is much more important than reducing the run-time, and that's why I like Python.
So true, and something many eager-optimizers need to hear! They will certainly learn from experience as they discover where they spend their time.
Hi Peter, well I've been a software developer delievering a (Basic) scriptable technical measurement data PC-application NI DIAdem, and we put most of our development effort into giving our users like you (engineers & physicists) optimized script commands, which behave similar to that of Python. So using already implemented loop command may give you better performance than own loops as described here.
that's the thing tho, they're not mutually exclusive, you can have a very easy to write and a decently performant language (look at julia)
why python is slow is not bc it can't go faster or bc it's a sacrifice for expressiveness (as there are python implementations that are way way faster but that lack total compatibility with libs)
The reason why python is slow is because it's poorly designed. It uses linked lists for lists instead of vectors (log of n vs log of 1), it compiles to an intermediate language but it leaves it at that, there's no JIT in Cpython, the global interpreter lock impedes people from writing super performant code, etc, etc
Python is slow bc of the choices they've made over the years, they're consistently made choices that affect performance, they never took the time to think about it and see how they could implement abstractions that would not take a big toll on performance, they just added things on top as they saw fit.
But that's not a big surprise once you see that python was a pet project of Guido and he just made it bc he could and people happened to pick it up and spread it (even tho it's performance and features were worse back then)
You're using Python for simpler scripts, like what it was made for. That's a valid use case. Making large systems in Python is just stupid for the simple reason that it will be expensive to run.
That's depends on task, but when we talking about science - it's 100% true.
I'm in my 3rd year in Computer Science and trust me, I've watched SO many youtubers that do this and you're by far the smallest for an unknown reason. You explain things really well, you don't beat around the bush and overall just make quality content. :D I'm now subbed and I'll be coming back for more :D
Thanks so much for the kind words!
Excellent content and testing summaries !
Then again, he's the only one that would place "just know your result ahead of time lul" as a viable suggestion.
What I did not know about this video, is, 100_000_000.
I did not know if python has a feature to add underscore between number.
Thanks, anyway :)
I add secret lessons to all my videos!
Added in Python 3.6.
Same! I didn't know this too!
Same
Same
Fun fact: A good C compiler would detect this common pattern (summing sequential numbers) and replace the for loop with the direct calculation of the result. And if even the number of elements was hard-coded, the compiler would even simply replace the whole code with the hard-coded numeric answer.
I always wondered what's the point of this? Are series we have summation formulas for really coming up in a real software? As far as I can tell the only application is to show it in a presentation about optimizing compilers. And frankly, this is even a wrong showcase of optimization, cause the useful techniques are general and pliable enough for compiler to reason about. This is just a hardcoded nonsense.
@@Czeckie this exact problem isn't common. But good compilers can recognize loads of (small) patterns in a codebase that, when combined, can amount to a large speedup. Sometimes some optimized chunks can again be combined into a new, smaller and/or faster, chunk.
@@oj0024 Yes, if you write shitty enough code the compiler cannot fix it for you because it can’t make assumptions about what you’re trying to accomplish. This is not a revelation or even a shortcoming, its a feature.
@Matt Murphy He is right that compilers can do it, but if we look at the two most modern compilers right now only one of them can detect it and only if you write the code in a non-obvious way.
The second part showed that compilers don't do it with constants either, they just execute the code internally. I'm not saying that compilers aren't doing great optimizations, just that the claim that "A good C compiler would detect this common pattern" is wrong.
Especially since such optimizations don't gain anything (I'm talking about the pattern transformation, not the constant evaluation). They can only be done in very few cases, and you usually assume that the programmers know what they are doing. (e.g. the only reason to implement the summation code would probably be for benchmarking purposes)
@@joestevenson5568 clang didn't detect the pattern when using a normal for loop, it only detected it if you write the non obvious and usually not used while(n--) loop.
5:08 It's not about NumPy being primarily written in C (so is CPython), but about Python' dynamic vs. NumPy's static typing. In a dynamically typed language like Python, every time the interpreter encounters an expression like c = a + b, it will first have to check a's type (and find out it's an integer). Then it checks b's type (and finds out it's an integer too). Then it checks if there's any known operation to add (+) two integer types (of course there is). If you do this one million times in a row (e.g. in a loop), the same checks will be performed over and over again. It's like asking "Is a still an integer? ... What about b? ... Do I know how to add two integers?" one million times in a row. And of course all of that happens at runtime. That's why Python is so damn slow in that regard.
Yeah... that's the price you pay for flexibility. But there are ways to overcome this. They all require some kind of hinting that the structures you use contain only elements of the same type.
> In a dynamically typed language like Python, every time the interpreter encounters an expression like c = a + b, it will first have to check a's type (and find out it's an integer). Then it checks b's type (and finds out it's an integer too). Then it checks if there's any known operation to add (+) two integer types (of course there is).
A dynamically typed language implementation doesn't *have* to do things this way, that's just the naïve approach (albeit the easiest approach out of the box for new implementations). A smarter specializing JIT can create precompiled versions of functions and expressions for commonly used types (e.g. integers), and can determine that they are invariant and avoid having to re-check the types each time you hit the expression.
The default CPython implementation does things the way you describe, but alternate implementations like psyco and PyPy take the specializing approach in their JITs.
Not surprisingly, PyPy executes these for and while loops on the order of 50-100 times faster than CPython does on my machine (and ~10 times faster than numpy).
$ python3 benchmark.py
while: 5.254076757992152
for: 2.9640753349813167
sum_range: 1.1016407490242273
sum_numpy: 0.6697274879843462
$ pypy3 benchmark.py
while: 0.05818561598425731
for: 0.07444041498820297
sum_range: 0.07667819698690437
I hope this problem can be partly solved if people declare data types of nearly all the variables as they need to do in C before hand like I do most of the time
@@sumnerhayes3411 There are tons of other reasons why dynamic typing is pointless and bad. It saves you having to *gasp* tell the program the type of a variable one single time; and it costs you dearly. All python projects should enforce static typing, even though that doesn't gain performance. All languages should be statically typed, period, because it is infinitely more readable and provides no development benefits at all.
@@kylehart8829 I would have disagreed prior to learning Python, back when it was just dogma in my head. Interesting how that works.
However, the learning process was made easier by removing a step. I think a better solution would be to make it extremely clear how to toggle Python into enforcing static typing, and why that's a huge benefit.
Sometimes it’s just impossible to express something in numpy - iterative algorithms, for instance. I had a very hot piece of code, which implemented an A* pathfinding algorithm, which was slowing my entire program down.
I looked for solutions and found a package called Numba - it allows you to decorate a single function and it would try to compile it the best it can. There are various optimization settings to consider, and in order to get *significant* improvements in performance (like, order of magnitude and more) you basically had to rewrite your program as if you were writing in C, but using Python’s syntax - so, no dynamic typing, using very limited subset of built-in types and refrain from using custom classes as much and possible.
But it didn’t require too much work, and it was a great benefit in the end - I ended up with a blazing fast function without needing to write it in actual C (not very hard) and then figure out how to couple this with Python (quite finicky if you’ve never done it). So yeah, I’d recommend looking at Numba if you need a more general solution.
Or just ditch python and go for java or other jvm languages or go or rust etc
@@NJ-wb1cz Eh, rust is likely one of the hardest languages to learn (after "modern" C++ with template metaprogramming). Julia seems quite easy to learn and pretty fast (if strongly typed it's close to C, maybe twice or three times slower, but not two orders of magnitude like Python). On the other hand I'm not sure how to structure code which is too complex for a single file and too simple to put it into package. It was basically developed for data crunching.
Also year or two ago Rust was lacking some libraries, especially for plotting graphs (C++ lacks them too), painting (Cairo) and even exporting data as images (no support for writting of 16bit PNG files)
@@NJ-wb1cz right, who would do something as stupid as just adding another library as dependency when they could add an entirely different development stack just to optimise one function.
@@julians.2597 why is adding libraries stupid? That's a strange stance. Sure, moving to another language is an option, but you'll likely use libraries there as well.
Before ditching Python - try those two things:
* PyPy
* Nuitka
Both are still Python and both don't require you to change any code ;-)
Very good video. Note: NumPy isn't just C, it's highly optimised C with SIMD capabilities. The only way you can go faster than that is with GPGPU like OpenCL, CUDA, Vulkan or Metal. (And even that isn't assured since GPGPU has initial overhead.)
Half through the video I thought: actually for this simple sum you don't even need to loop just the formula. And you didn't disappoint.
While this might seem like a very synthetic example which won't often be possible in real life, you'd be surprised how many times I (a programmer who is not even very good at maths) have been able to replace large loops and delete large chunks of code and replace it with a simple formula.
Not usually this clean and simple, but quite often you find that the code is looping through a lot of data calculating stuff and then filtering out stuff, and if you simply rearrange things you might be able to do the filtering before the loop which might even reduce the loop to a few statements without a loop. Very often this is not obvious in the code but very obvious when sketching out the desired results on paper.
Pencil and paper is definitely not obsolete.
Indeed! Never underestimate math! The feeling of deleting 30 lines of scratch and replacing it with 3 lines of math is indescribable.
@@mCoding it's the moments when you're lying in bed and it kinda hits you out of nowhere that are the best
Of course, it is kinda a let down when the divine revelations are actually wrong though
Doing it on paper with a pencil probably triggers a different part of your thinking process from all the years of doing math that way.
You can actually be even faster than with math, if you know how CPUs work. If you instead of calculating (n * (n -1)) // 2 just calculate (n * (n -1)) >> 1 it should (in theory) be a few clockcycles faster. While most math operations are pretty trivial for the CPU(+, -, *) division is a pretty annoying task, so you might spare some time with just bitshifting.
@@delqyrus2619 Bit shifting is math too. No different than multiplying and dividing by powers of ten being easy in decimal. A good compiler will optimize it though, so it's probably best not to bother.
The UA-cam algorithm got you mate. All the best!
Thanks for the support!
You want to be careful when timing functions, your processor likely speeds up or down based on system load and ambient temperature changes. Although you can still capture order of magnitude changes very easily like you did. If you're on a laptop though where cooling and power are limited, take extra care, and run multiple simulations in different orders at random times!
I cannot stress this enough! Benchmarking is such a difficult topic even though it seems so simple.
One thing you can do is to disable CPU boost clocks during benchbarks. This way, you get stable clocks regardless of load created by the tested function. For example on Linux, you can just write 0 to /sys/devices/system/cpu/cpufreq/boost to disable the boost or write 1 to enable it again. This’ll get you much more stable results that are comparable even if ambient temperature changes.
@@luziferius3687 that's a good approach! Myself I usually try to do perf stat whenever possible, and count the actual CPU cycles it takes. Run each command say 3 or 5 times and take the median!
@@mCoding Obviously the best way is to time it yourself with a stopwatch
imo it would have been better to use a smaller n in the function and a higher number of runs
If you're using a counter within the while loop, then yes that counter is slower than the for loop. If you don't, and you use the while loop as intended and doing the same stuff inside the loop, then it is essentially the same speed.
def while_loop(n=100_000_000):
s = 0
while s < n:
s += 1
return s
takes about the same time as:
def for_loop(n=100_000_000):
s = 0
for i in range(n):
s += 1
return s
Use the right loop for the right task. If you have to put an incremental counter in your while loop then you're using the wrong loop.
Comparing the sum with the counter limit certainly won't give the correct result in s: We are calculating the sum of all integers and analyzing various ways to do that. I. e., if the limit where 10, with 1,2,3,4 we reach it, way before reaching the correct sum 55.
This was a valuable lesson to learn when I was coding a neural network. I wasn't aware of this fact and was trying to parallelise, before realising numpy automatically did that. Turning everything into vector operations led to a speed increase that was crazy.
This is the reason why python is such a good programing language. You can write your program with easy and pretty syntax and use packages wrapping C code to do the complex and time consuming tasks. This just shows that you can use python and have fast running code.
I would also add that if your program's only focus is speed and performance then don't use python, simple as that.
People too often leave out the speed of development in their calculation of speed!
Python isn't special in this regard
People oftentimes forget lua.
A faster processor is cheaper than another developer to speed up development. That’s why I use python.
@@HydratedBeans unless you depend on oython-specific libraries I don't think the development speed is noticably greater compared to, say, kotlin. And a loosely typed language will always have more mistakes and harder to statically check
To the point, quality content. Subscribed!
Thanks for the support!
"The best way to loop in python is to not loop", hahaha, loved it.
Also got curious about doing the numpy sum but with the built-in range (because of the memory problem), how much would that impact the performance?
In my experience, I've never hit a RAM limit because of this, but in memory constrained systems this is definitely something to watch out for. I wish numpy has something like std::ranges::iota to use as an array-like.
@@mCoding There is a library called dask, is used in distributed computing and has support for numpy arrays. working with blocks consume less memory but might be a bit slower.
i think numpy.arange(...) creates a potential memory issue (it allocates the a large array in this case), but the Python3 builtin range(...) doesn't allocate a large array - it is a lazy iterator that simply tracks the last value generated so it knows what to generate next.
Fun fact: If you write this in C, gcc/clang will automatically optimize the for-loop sum into n(n-1)/2 for you.
really?
There's a fun story about that summation formula. Apparently mathematician Carl Friedrich Gauss came up with it at a young age when his teacher set his class the task of adding up all the numbers from 1 to 100 and was very surprised by Gauss' solution. Regardless of the veracity of the story, the formula is indeed often referred to as the Gauss summation formula. Mind you, a lot of things in math are named after Gauss as he was extremely prolific.
You left out one important aspect; the teacher wanted to keep the kids busy for a while and Gauss came up with the solution within minutes.
Another important aspect, the formula is actually (n+1)*n/2.
@@thomas-sk3cr Exactly. I was looking for this correction.
@@thomas-sk3cr which actually illustrates a pitfall of using external math formulas instead of letting the computer do the work, that they are going to be less clear about what the code is supposed to do. getting the wrong answer really fast is worse than getting the right answer slowly
@@adrianmizen5070 not sure if I understand you correctly. I do think that it's definitely better to use a formula if there is one. But it should be properly documented of course and you have to make sure that it's the right one. Maybe we can agree on that? I don't think you should reimplement well-known formulas with loops just because it's more clear to the reader at first glance :)
Oh this is an amazing recommendation by youtube. I came to the same conclusion as you while having massive performance issues in my machine learning models in non obvious parts of the code. Once I noticed that it is the loops that are awfully slow I tried refactoring them as much as possible and ended up using a mixture of things u've shown here. I didn't even know or speculate that it's because of C but now everything is much clearer in my head. Thanks!
You should also check out Jax, Numba, and Cython for potentially enormous speedup for little effort if you are doing ML.
you're not supposed to do ML with for loops in theory you can and it's very undestandable if you do it that way but any solution using matrices/tensorflow will be faster. Like he says in the video numpy is fast
Oh snap! I thought I was saving on resources by not using additional modules like numpy but now I see it's ridiculously faster. I wasn't aware pre-built functions were that much better either. Thanks for the reveal!
He actually did say in the video that Numpy uses a lot more resources, and with a bigger number, you may not have enough memory. Your original thought was correct, that you do save resources by not using additional modules. You may save time, but not resources. Also, if you compile your code, those additional modules will need to be compiled in, which will make the executable larger, and require more resources as well. As we are quickly approaching the limits of how far we can go with hardware (CPU technology is pretty much capped out, unless someone figures out how to make atoms smaller), we need to be more mindful of the amount of resources we use. Programmers need to stop being lazy. They have to stop thinking that they can bloat their code with hundreds of modules because CPU's will be faster by the time their code is in use.
Just because speed and bandwidth are available, doesn't mean they should be used. For example, I know of some websites that include 50-100MB of Javascript code for each page view. Sure, dial-up is a thing of the past in most parts of the world, and people have faster internet connections than ever before. The page loads pretty fast on a gigabit connection. What about the people who have much slower connections? Is it really necessary to design your website where every single page view uses an entire day's worth of bandwidth on a dial-up connection? If you have several pages of the same website open in different browser tabs, your memory utilization goes through the roof! I've seen a single browser tab on one of those websites using over 2.5GB of RAM! I haven't even checked how much unnecessary Javascript is downloaded when using something like Facebook, but I've seen a single tab with Facebook open using over _5GB RAM_!!!
This kind of waste needs to stop. Even if it does save a second or two here and there, it could cost several more seconds (or longer) in other areas. I've always hated importing modules in any language I've ever programmed in. I would much rather write my own code to do what I need done, whenever possible. I realize that's not always possible, but it's always better than adding bloat to a program that's not needed. Look at what's known as the "demo scene". It's something that started LONG ago, where people try to make the most exotic program possible in assembly language that fits within a 64KB executable. People have fit 10 minute long full screen videos with music into 64KB! They have created complete first-person-shooter games in 64KB! These programs run plenty fast on hardware from over 30 years ago! If more of todays programmers took the time to write good code, we wouldn't even need computers as fast as we have.
Sorry, that started off as a 1 line reply, and ended up going 100 directions. There's still soo much more I want to say about the topic, but who is going to read it anyway?
@@eric_d I will! I'm very new to programming and I really appreciated reading that. Thank you for the reminder :)
Awesome content, got very surprised when I saw your low sub count
I appreciate that!
me too
Thanks for this, I was trying to explain to people in JS that essentially always calling a native function is faster than writing a more optimized(in terms of Big-O) way of doing something. They were asking Big-O proof and didn't believe me that it's just not relevant. Looping through a string 10 times with a native regex check is faster than iterating through it once in pure JS.
It was the one addition statement executed many times that slowed the python "for" loop. The conclusion is that looping can be very fast but executing python statements anywhere in the code is very slow.
I think your channel deserves more subscribers.
I appreciate that!
Another reason the numpy approach is faster: The elements of a numpy array must all be of the same type, and there's no overhead to check (and convert to numeric, if necessary) the type of each array element.This probably contributes much more to the increase in speed than the language difference (C vs. Python.)
Other than that, the sum itself can be computed parallelly and fused together later, thus we are not forced to iterate serially element by element!
In compiled languages loops are often optimized on compilation stage, such as being unrolled where necessary, with particular emphasis on efficient cache/prefetch from memory utilization. One should look on original Fortran code of LAPACK routines where some loop optimizations were done 'by hand'
@@dmitripogosian5084 loop unrolling is never efficient when cache and prefetch is the topic unless you explicitly wrote "from 0 to 4" just to avoid copy pasting code 4 times by yourself, and even that may not be worth unrolling depending on body of the for loop which most compilers can decide by themselves if you do profile guided optimization pass. There's a reason why -O2 more performant for most software than -O3 and unrolling is one culprit of that.
“Python is an incredibly slow language”
You just insulted my entire race of people...
but yes.
I guess I’ll be first then...
Your guess right
Its not incredebily slow, compared to others its slower yes
but it can acomplish the tasks its meant for with no issues
Machine learning, Automation, Data Science and Cyber Security
@@soupnoodles Yes but you can do all of that much faster with C, you're better off using C.
@@soupnoodles Also why did you like your own comment?
Great comparative video
The math solution if you include n in the addition is ((n+1)*n)/2
Because you used range() and the while loop was set to < and not
Flooring the division is very much recommended, because otherwise python will convert the result to a floating point number which may cause you to lose precision
@kurogami2771 it does in this case as we are adding up a series of numbers from 0 to n, which by definition excludes negative values.
As for flooring, it is true that we should avoid using floats when not needed even for just one operation so we can either floor or int() the results.
@@aioia3885 Either n or n-1 is even so it's never a float. Using floor isn't required.
@@abebuckingham8198
edit: "either n or n-1 is even so it's never a float" is not correct because in python division with / will always result in float conversion (see type(1/1)). the rest of this reply is just trying to clarify what I now realize was not necessary to clarify
in my comment I said "flooring the division is very much recommended" but what I really meant was not using the floor operator on the result but instead using the integer division operator. in python if you do a/b the result will always be a float bu if you do a//b then the result will be an integer if both a and b are as well. doing a//b is basically like doing int(a/b) except that there will be no conversion to a float and so no precision will be lost. for example, int((10**18+2)/2) evaluates to 5e17, because the division was a floating point one, but (10**18+2)//2 does result in the expected value of 5*10**17+1
This channel is gonna blow up great work
That's why Gauss exists : simple use of (N-1)N/2 in this case
And he came up with this at such a young age too!
wait isn't it: (N+1)*N/2?
@@bFix He takes sum to N-1 in the example
@@LeDiop Yes, so it is n*(n-1)/2
@@Samsam-kl2lk yup
Can you do a comparison of recursion over loop?
I have many video ideas in the works, this is definitely a contender.
Usually recursion is slower and takes up more memory. And many "big" problems cannot be coded recursively due to stack size limitations. But if compiler/interpreter can do tail recusion optimization, it's a very nice option. Unfortunately, not all problems are applicable to it.
The CPython interpreter doesn't do tail recursion so even if you write it out correctly it's still faster to loop, save the recursion for when you forget how to convert your code into a loop correctly
@Sec Coder Check out Lisp or Prolog. The recursion is conceptual in terms of the problem or model.
Imagine a family tree. The definitions and relationships are recursive. Mom and Dad produce offspring each of which in turn can become Mom's or Dads who produce offspring.
Comes into its own when these Relationships are highly complex ( Imagine every leaf on a real tree was in fact another tree and so on and so forth. The Mantlebrot set is recursive )
Essentially they become huge lists or recursive data structures.
Those languages became prioritized and optimised around lists and recursive data structures.( With a few constructs and language primitives not found in the likes of C or Pascal such as "is a type of "
"is connected to" "is related to", "is derived from" etc etc )
Pythons modelling capabilities Tuples Lists etc etc Means most of it can now at least be modelled easily before being translated to something else for speed.
Ps for fun check out Occam . It's editor was indent sensitive just like Python
It also let you mix sequential and parallel programming .!!!! Oh lordy 😁
@Sec Coder It lets you express common algorithms in a purely functional way, i.e. without needing to mutate variables. Doing so makes your code easier to analyze mathematically and makes it more composable.
Great video. I’ve always found list comprehensions to be faster than the equivalent for loop or while loop.
There is a language called Lython, so you don't try to loop in Python; instead you poop in Lython.
Thanks for the tips!
The comparison with NumPy is not fair because it is made for array computing and hold an unique data type.
By the way, you deserve far more subscribers.
Thanks!
I really like your channel! These are very interesting insights into the Python language in small videos which makes them easy to digest. Keep up the good work!
Awesome, thank you!
Rust with 64-bit numbers looks to be about 50× faster than numpy on my box (once I convinced the compiler to stop solving the problem at compile-time without losing optimization). Rust with BigUint bignums looks to be about 5x faster than numpy. The obvious for loop with pypy3 is about 12x faster than numpy. So there's still plenty of room for speedup.
Expertly presented. I never would have thought to look at it where C was used behind the scenes. I will definitely try and make the better to worst progression habit when writing Python.
Half way through the video I thought to my self why even loop to find an answer to a simple arithmetic progression and then I was surprised that this was addressed as well. Wow, impressive.
Thanks!
It serves the purpose to explain the problem of loops and also the process of finding different solutions to a problem
It’s a nice exemple
you are the only programmer that i pressed liked, subscribed and the bell icon :)
I’m a Python novice and I found this to be really interesting and useful. Many thanks!
Just wanted to note that using an iteration index in a while loop isn't really good practice anyway, since that alone would almost always mean a for loop is the correct pattern. So it's not really surprising that the while loop was slower in this experiment.
I'd also like to warn people that numpy is only faster if you are operating on large sets or large arrays of numbers. If you apply a numpy function to a single number, it's almost always much slower than the equivalent function in Python, presumably due to some overhead.
That isn't the point. The point is the efficiency of raw Python.
Good example of how being good at math and algorithms beats all micro optimizations of the code itself :)
Exactly!
My own results were 6.06 seconds for the while loop, 4.25 seconds for the for loop, and 569 ms for the numpy loop.
I tried it in Julia (almost identical code except for the end tags), and the results were 1.3 ns for both the for and the while loop. Checking the lowered code it turned out it had optimised the loop away completely, so we got sum_math for free! So instead to force it to do the actual sum, I modified it to s += i * (rand() < 0.5), and it came in at 576 ms. So even with a call to rand() and a branch, Julia was still nearly 10x faster than raw Python, and was about on par with NumPy (which didn't have a call to rand() ).
If I force Julia to construct the range object with collect(1:n), then I don't need so that it matches NumPy, then it gets even faster at 290 ms, but now it allocates 763 MiB of memory, so clearly there are pros and cons of using range objects, although I'd stick with the range object for most usages.
So, if you like Python but you want C performance (and you like to use arrays a lot and don't like to add all the extra numpy syntax just to do that), maybe check out Julia.
Most like myself use Python for the libraries, nothing else comes close except for maybe C++. It would take a lifetime to recreate all the libraries in Julia.
@@incremental_failure there has a lot of ongoing work to make native Julia libraries (mostly in the scientific and statistical communities), but Julia can also call C, C++, and Fortran code, and the PyCall package already allows you to embed Python in Julia, so you're never left in the lurch if you choose to use Julia.
I've already seen Julia be hugely faster and more flexible when doing a lot of ODE modelling than when using R and calling deSolve which is written in Fortran.
This channel is a gold mine, thank you so much for these. Keep them up!
Thanks, will do!
In my one of my programming classes we learned about loop unrolling. This is much more handy to know about. :)
Your channel is great! I suggest you use a high pass filter for your voice to cut off the low frequencies that make your voice sound muddy, great job so far.
Thanks for the tip! I have no idea what I'm doing video editing so tips like this can really help improve my video quality!
It would be a good idea to show examples that does a bit more work within the loop, to show that if your loop body is costly then how you loop is largely insignificant.
Yeap, I was thinking the same thing. These tests really don't tell you anything meaningful.
when you dont know how to code but you know english: "i know some of these words"
I like this video very much. Subscribed.
Also, interestingly enough, I am taking a class about functional approach to engineering problems and the professor repeats every class to avoid looping like you would avoid the Black Plague. Lol
what's the name of that course?
@@braulioacosta6816 functional approach to engineering problems lol
I was so much expecting this formula to come up at the end :p
This is very informative and well presented. Also, it seems like you are very talented in explaining stuff, but this is only the first video of yours I have watched, so I'll have to do some further benchmarking ;)
Thank you!
Very informative video, keep at it and im sure you'll grow a ton.
I appreciate that!
Cool video. I’d be curious how `numba` stacks up in this, both the JIT and AOT variants
Numba would make it closer to the unoptimized C/C++ code. I used it to render fractals and I got it to run 50x faster than unoptimized Python, and almost the same speed as the C code. The thing is ... that you can make the loop a lot faster in C too because you can use SIMD intrinsics, so you could do 4 or 8 or even more operations in a single instruction.
Just tested.. The JIT execution time is similar to the math version. To all Numba / LLVM developers: Great job!
You just convinced me to learn algorithms and functions. This was an amazing illustration.
Loved the video, certainly gives me some ideas for my own code optimisation.
However, I think your maths is off? In the arithmetic summation formula Sn=n/2(2a(n-1)d) being simplified where a=n, you've forgotten that the common difference will be -1 instead of +1 as we're counting down, not up. This should simplify to (n*(n+1))/2 instead of (n*(n-1))/2? Tested with the numbers 1 through 10 yields 45 on your formula, 55 for the correct answer.
I think numpy also will generally, if it can, use parallel CPU operations (SIMD) to process the array, which is probably why it's faster than standard sum() (and also why it makes sense to create the array of data with arange() even if each value could be calculated in the loop. But definitely good to be aware of this memory requirement.)
The numba package may be useful in some cases.
Absolutely, if you can use Numba to @jit your problem, them it will very likely speed up the solution! I'll probably cover this in a future video.
@@mCoding I did a small test.
from numba import njit
@njit
def int_sum(n=1_000_000_000):
s: int = 0
for i in range(n):
s += i
return s
timeit.timeit(int_sum, number=1)
5.373000021791086e-06
timeit.timeit(int_sum_p, number=1)
44.33714248199999
int_sum_p is the same function, without decorator. The result is very close to C speed. But obviously, the best improvement comes from doing the maths properly.
Numba is amazing, if you can limit the code in the @jit’ed routine to simple loops and math and/or numpy operations. You can’t use things like lists or dicts
Should do all the examples with numba this is also interesting. Cython also. Masterclass in squeezing loop performance, could also show sum(x for x in range(n)). Ideally use matplotlib to show the data for different sum sizes and prove the obvious that it is linear
@@gregorymorse8423 When trying to make a comparison with compiled C++ code, I discovered that clang removes the loop altogether, applying the maths that the programmer should have done in the first place. And I suspect that numba applies the same optimisation, otherwise the total time does not make sense with the clockspeed I have.
Great video. Commenting mostly to help the algorithm push this. Would have loved to see a comparison with straight C thrown in there given that C code was the source of the speedup. What's the cost of the python to C transitions?
A C compiler would probably compute the answer at compile time so there is nothing to compare! A Python to C transition would be a huge undertaking even for a small project.
Here are the results on my pc:
python with for loop: 8.74s
c with gcc and no compiler optimizations: 0.235734s
c with gcc and full compiler optimizations: 0.048770s (gcc doesn't detect the pattern and actually executes the entire code)
c with clang and full compiler optimizations: 0.000001s (clang detects the pattern and optimizes it using the Gaussian formula)
@@oj0024 wow, there is absolutely nothing to compare. I knew python was slow and C was fast, but I didn't know that the difference is to this extent
@@mCoding Nuitka will take Python code and transpile to C (which uses the Python standard Library to do the object iterations) - on some applications it is 2x faster than normal Python (and it is a work in progress).
This is probably the first coding channel that I looked up just for fun. Great work. Can you cover advance pythonic syntaxes and topics?
I'm always open for suggestion! What do you want to see?
@@mCoding Nothing in particular just any fancy pythonic things which help improve performance and code readability.
Also, one thing you can do is make a mess of a code and start refactoring it. That would be amazing to see.
Wow .. I learned about 5 new things from this.. using timeit is a great idea.. thank you!
If you are writing pure and iterative heavy Python code, then PyPy is your best bet for performance. I ran the same test as mentioned in the video on Python 3.10 and PyPy 3.9 and my god the difference was staggering! Here are my results :
Python 3.10:
for loop : 6.6 secs
while loop : 11 secs
PyPy3.9:
for loop : 0.22 secs
while loop : 0.36 secs
PyPy is really meant for long running code and real world results show that.
There is another option. Often you can use the numba library to get a speed boost over numpy.
Great video! Another idea is to time yield in functions and generator expressions in comparison to the traditional loops
Great suggestion!
yes, graph the size of n against the time it takes different strategies to compute the sum >:3
make the computer dance for my entertainment!
Dayumn... TIL something new as well as some new trick. You definitely earned my sub!
I like how he showed us a clever way to implement loops and NOT how to think outside the loop and just find the answer with math
Awesome video! A few days ago I was comparing the execution time on using a loop vs. the Gaussian formula, it's nice to see some other options in between.
Thanks for the support!
@@mCoding Very interesting video indeed, something I never really thought about … but the math-guy in me needs to say it: You got the formula wrong… 😬
Very informative. Excellent insight into Py inner workings. Great incentive to employ Numpy (and Multiprocessing) whener possible. Give a simple yet splendid example of why knowing math is important for any programmer who tries to use Py for data science type applications.
That’s why I would use C to program a missile.
Watch out for those null pointers though, wouldn't want your missile program to CRASH!
Usually Ada or SPARK are used in high reliability systems like aircraft instead of c since they are much safer
I had heard about this before, thanks for sharing, it makes a lot of sense in safety critical applications. I know NASA still uses C for (as least some?) space shuttles!
@@bamberghh1691 I knew ADA is used for missiles... But is there a specific reason why.
@@mwanikimwaniki6801 Ada has been specifically designed to catch as many mistakes at compile time. It’s static analysis tools are extremely thorough, and in general, it’s designed to prevent undefined behavior, from its type system, to its syntax. It’s an interesting language.
If I recall correctly there was an optimisation in LLVM/Clang that was able to detect looping over elements in order to add them and inject the n(n-1)/2 directly into the assembler. :)
That is ridiculously clear, it blew me away. Straight to the point, extremely clear explanation, very methodical, and everything is shown perfectly. Just purely amazing tbh.
def recursive_loop(n=100_000_000):
s = 0
def loop(i):
nonlocal s
s += i
i += 1
return loop(i) if i < n else s
return loop(0)
> *RecursionError: maximum recursion depth exceeded in comparison*
import sys
sys.setrecursionlimit(100_000_000)
> *segmentation fault (core dumped)*
😂😂😂
:) nice try A+ for effort 👌
Nice!
Question: if you increment i inside the for loop, won't that affect the amount of iterations (skip every other one)?
It actually does not affect the number of iterations. The iterator of range keeps its own internal counter that python uses to assign to i each iteration.
The range is built before executing the loop contents.
It's also worth taking a second to remember that even in its slowest form, the python code is adding up 100 million numbers in 20 seconds. It's slow, but "slow" is a relative term.
This is a good point that everyone should keep in mind!
Is slower than other languages but still enough because computers are very fast
On any CPU capable of many teraflops, 20 seconds for 100 million numbers means eons. It's like walking and taking a step only every 5 seconds, short distances It's not that bad, but will badly scale once you need to travel farthest.
This is the first time I subscribe to a channel based on seeing the first video I watch from that channel. Great job!
Such an honor! Thank you!
@@mCoding THANK YOU for creating such quality content.
Cool video. I would add that numba and cython are also super fast alternatives.
Your lessons are awesome. They have a Corey Schafer feel to the presentation in their brevity and clarity.
Thank you! I've seen many of Corey's videos, so I'm glad to hear that!
"Is there anything o beat numpy?"
Try compiling numpy top native machine code using numba:
@nb.njit(cache=True, parallel=True)
def numba_sum():
return np.sum(np.arange(100_000_000))
> 0.00064 seconds
Or suing your GPU (Cupy):
def cupy_sum():
return cp.sum(cp.arange(100_000_000))
> 0.001747 seconds
Cupy should be an order of magnitude faster than you CPU, but only if you do some more complex computation than calculating the sum of the array.
Good point! Maybe some day I'll do a comparison of which C library for Python is the fastest... so many choices though.
@@mCoding Would be interesting to watch.
Good video btw. I did not expect the "i = i+1" to have such a huge impact when iterating.
Pretty certain the numpy functions are built using SSE/AVX intrinsics, hence their massive speed up compared to scalar only methods.
for people more interested in the formulae, it's for sum of n terms of an AP
the formula is:
(n/2)[2a+(n-1)d]
where,
a = starting term
d = common difference
n = till nth term
Thanks for the information! It's really helpfull to get a glance into the Python interpreter's inner workings
This seems insane to me that summation took more than a second! What is the record for the same in raw C/C++?
It would definitely be much faster. Especially since this is a special optimization case. It might not even generate a loop.
100_000_000 is a big number! Agreed, C/C++ would likely optimize it away.
@@mCoding I think you should be more clear regarding the difference between language and implementation. It was definitely an interesting video, but this is not a problem inherent to the language itself, but rather a problem of (probably?) CPython. Also, C doesn’t optimize anything, gcc does. At least if you don’t tell it otherwise.
@@toebs_
Oh yes you need to enable the optimization flag -On n for random number .
Python is kinda like the doctor:
It's the last place you want to be, but you go when you need it, because otherwise things might get more complicated
I personally like to start with python. Its vast libraries and simple structure is easy for me to test programs. Later when I decide on a specific implementation, i'll code that in C++.
@@akshaybodla163 Yeah, I can see that, it's nice, but how do you transfer a python library to C++?
@@andreiiaz2097 A lot of C/C++ libraries have Python bindings so you can just use the same ones a lot of the time.
In reality, python is the first thing most people will use because it is easy to use.
Python is only usefull because if works as a glue for other languages. Easy access to libraries written in more difficult languages.
Such a fact. I do statistics research using Stata. My rule is Stata where I can, Python where I must.
Actually any modern C or C++ compiler with optimization turned on (-O3 or similar) would by itself compute the value of the sum compile-time and just inline the result. I am shocked that Python interpreter does not even try to for example throw out unused instruction during compilation. There must be a way to enable this. Or maybe this instructions actually have a side effect because of its weird namespace policy and cannot be thrown away without potentially spoiling the code. I am not sure...
Both C and C++ compilers such as GCC and LLVM are written in C++.
Python is written in C.
JavaScript is main Python's competitor in easy to use scripting language... And it's V8 is written in C++ and its JIT is incredibly fast, the only thing that can compare is LuaJIT which is practically abandonware because one dev who made it has stopped caring and noone else is intelligent enough to pick it up and maintain it properly. Still, the difference is that LuaJIT is very fast... But it can't stream code and execute it in my browser as fast as possible, most projects don't even use lua for one reason or another.
Just shows that C is obsolete language and all good programmers use C++.
There of course will still be C zealots that will pretend otherwise, but they can't even write their own compiler and have to depend on C++ programmers who they hate more than they hate C++ for some reason.
Yes I strongly agree: 1.) optimized algorithm over loops, 2.) optimized (not runtime interpreted) loop-code, 3.) if none of the former is possible slow interpreted loop directly in Python. Where (3) is ok for best flexibility expirience when learning or developing & debugging inside new algorithms.
I've been a staff software engineer (development & quality assurance) delievering a (Basic) scriptable technical measurement data PC-application NI DIAdem, concentrating much effort into giving app-poewer users as script-programmers (engineers & physicists) optimized script commands, which performance expirience behave similar to that of Python. So using already implemented looping command may give you better performance than own loops as described here.
that is why you don't do intensive processing in Python itself, but rather use packages that implement the functionality you desire directly on C
that said, the best way to loop in Python should always be to use the simplest construct that fits your use case, since Python gets more value from readability then from scavenging microseconds of processing
it's also always good to import this
Well said. An indictment of Python.
It's not really a fastest way to do it, it's always faster to use a math formula to compute a thing, if a formula exists.
The purpose of using Python (or any other language) is not to replace math formulas but implement an algorithm to solve problems in an easy way (whatever easy means).
Another way to be faster, is to implement the operations you need in assembly, but doing so you miss the point.
Math is faster than assembly, assembly is faster that C, C is faster than Python and Python is faster than bash. But they are different languages and need different skills.
The correct answer is: Prefer "for" over "while" to loop in Python because it's faster.
4:05 "Is there anything we can do to go faster?"
Yes. Write the code in anything but Python.
Did you run these tests with Cython? Probably bringing everything to C would have interesting results. The async model on the for iterator for example should not be faster than a while loop unless the async code were optimized out.
The point about math is incredibly true: I optimized my VarInt bytesize computation algorithm from iteration to bit operations and lookups, and it now runs (at 2^32 loops) in ~2s vs ~60s
WHOA 45K?! The first time I saw your channel you had 12K subs. C0ngrats and keep it up!
while loop 6.429943797
for pure 4.258419519
sum range 1.4508647020000005
numpy sum 0.2784187069999984
math sum 1.868000000015968e-06
iMac i7-10700K, Python 3.9.2, macOS Big Sur
Thanks for sharing. Other people reading this, let us know what your numbers are too!
while loop 5.056872800000
for pure 3.253743700000
sum range 2.379359199999
numpy sum 0.138200499999
math sum 6.299999999015e-06
Desktop PC with AMD R5 3600, Python 3.9.2
Strange that my math sum is so slower than the Intel iMac while my other operations are faster. If I was to guess, there's something in Windows that adds a little delay before or after the operation runs. Might be that it takes a bit longer to call the function, but short enough that it is drowned out in the longer operations where my CPU runs faster.
@@jensdanbolt6953
while loop 4.5906675
for pure 2.9198796
sum range 1.9685457999999993
numpy sum 0.10591909999999949
math sum 2.799999998970293e-06
the same iMac, i7-10700K, Python 3.9.2, Win10Pro (Boot Camp)
@@atvarsgr While we are at it here is the contrast to a c implementation:
python with for loop: 8.74s
c with gcc and no compiler optimizations: 0.235734s
c with gcc and full compiler optimizations: 0.048770s (gcc doesn't detect the pattern and actually executes the entire code)
c with clang and full compiler optimizations: 0.000001s
Let's also add to this that it's not just the speed that matters. Choose the right abstractions. If your language can only say "while" and "for", that's a lot to write and a lot to comprehend to work out the meaning from low-level statements like "increment" and "compare". The code is much clearer if it literally says "sum a range". The more iteration abstractions you have, the clearer the intent of the code.
Thanks for that! I had no idea about the ratios. Two ideas though:
I'm doing statistics and ETL jobs with pandas. I would've been interested in other methods as well like recursion or doing the same with pandas instead of numpy (whether it's any slower than pure numpy), pandas + apply + lambda, etc.
Second, going one step further and optimising performance for if-elif-else / mapping function would be great for another video.
Didn't knew that Python is SO slow with those loops! Good video!
btw it's doing 100M iterations so it won't affect too much in small iterations
Great stuff! I never considered that while loops would be slower than for loops, but even as you started I was guessing it would!
Glad you liked it!
The conclusion: "maybe you didn't need that loop in the first place". I like it.
Indeed loops are the piece of code that is really worth to think through. Maybe part of computations is the same in each iteration, then calculate it before the loop, then in the loop just use the result.
If you can speed up a critical loop at the expense of code readability, do it, but add comments why and how it's done.
Today's lesson:
*Pay attention in math class*
Great video! Really informative. I'm surprised by how much faster numpy sum is over the built-in sum.
As an aside, the fastest solution would be to use the sum(1...n) = n(n+1)/2 formula. The fastest loop is no loop at all :)
I got these results:
while loop: 3.0520888
for loop: 2.9422460999999998
By removing the unnecessary i variable in the while loop:
def while_loop(n=100_000_000):
s = 0
while s < n:
s += 1
return s
That i has to be type checked by the interpreter in every loop.
You pointed out the range factor yourself, which also contributes immensely to something not essential to the speed of the loops.
I just beat numpy with a standard C++ program using a register ...
while loop took 2.5 seconds to complete, numpy took 123ms approx. My solution took 56ms, compiled with g++, 65ms with gcc.
Without register flag on the int, time jumps up to 66ms and 72ms respectively.
int main(){
register int rbx=0;
for(int rax; rax