Array oriented designs are a the core of the low level model of a trading system. And while this array view is much of what this talk is about, it is important enough to reemphasize it. Also, template based objects are handy to glue your arrays together so as to fully optimize the result.
At 23:00, when stable_vector is explained (built using boost::container::static_vector), just mentioning additional info for viewers. boost::container::deque has a feature to allow configuring the chunk size (called block size in Boost).
I think there is a flaw with this design. Since the spmc queue supports variable-length messages, if a consumer is lapped by the producer, the mVersion field the consumer thinks it is spinning on is probably not the version counter field at all. It may well be spinning on some random bytes right in the middle of mData. Then if the random bytes happen to be the version of the consumer is expecting(although the probability is very low), it could be disastrous. The customer does not know it is lapped at all, and continues processing with the meaningless data.
in the slide SPMC Queue V1, there is a line assuming the mCurrent buffer format is std::memcpy(&size, mCurrent+sizeof(MessageSize), sizeof(MessageSize)); it sounds like a coy paste error, what actually you mean is std::memcpy(&size, mCurrent, sizeof(MessageSize));
Note that the write increments a counter by one, copies, then increments by one again. So if the consumer reads in the middle of writing, the counter is odd (or the lowest bit is 1). Only when it is done writing is it even again
@@Alex-zq1yy What happens in the scenario where we have 2 writers: Writer A increments a counter by one, and is now writing. Then, while the writing is in progress, Writer B increments value by one as well (to then start writing). Now, before Writer A increments the counter again, consumer reads and counter is even, despite none of the writes being completed. Wouldn't that be possible to happen? Perhaps the full implementation also checks preemptively if the lowest bit is 1. Then this problem wouldn't exist.
It just adds one, if the number is odd then a write is in progress. What is confusing however is the assembly shown, since it would not work with the add 2. It would fail on the load, but then the if statement he shows is not needed. And when the load fails the data would be inconsistent, which I guess is by design?
I am pretty certain that SeqLock doesn't work with multiple producers - if one than one producer is in Store() the memcpy can be overlapped - leading to a corrupted value. E.g. Imagine T is a 1mb array. Producer 1 copies 512k, gets descheduled, Producer 2 copies 1mb, Producer 1 is rescheduled and copies the remaining 512k. Now T contains half of the value from producer 1 and half of the array from producer 2. In order for this to work, the memcpy would need to be atomic/synchronized - and it still would not provide an ordering consistent with the mVersion. Not to mention that if two producers increment the mVersion, the reader will see it as stable/valid even though neither memcpy has completed even if ordered. The provided code is only valid for a single producer.
Of course, the provided code is simplified. The key observation is that the lock is considered to be held by a producer if the version number is odd, and it is free if the version number is even. To support multiple producers, you can interpret the code between the two version updates to be the critical section. The first producer to succeed in incrementing the version number from even to odd atomically acquires the lock. Other producers won't be able to enter the critical section because the version number they will see is odd, and they will have to wait until the first producer finishes updating the buffer and releases the lock
Because a copy assignment might have control flow and branches. Imagine this, while the copy assignment is executing in the reader, a 'write' operation is taking place on another thread. At first glance that might seem OK since the value will be discarded when the version check fails in the reader. However, it is dangerous because it might result in unpredictable state in the logic. For example: if (member_ptr != nullptr) { use_member(*member_ptr); } You can see how the check can pass and before the body of the if-statement executes, the writer would assign nullptr to member_ptr and boom you crash. So, the solution is to either do memcpy and hope it works at all, if not it will crash spectacularly most of the time, which should be a good indication you're doing something wrong. Or a better solution is to constrain the template parameter to be trivially_copyable
@@shakoooskin all these cases we are talking about struct of data, in the C sense, POD data. Default copy constructor or memcpy are the same, they both will be inlined anyway.
It says on the diagram that they have a trigger price at the FPGA... so I'm assuming they have something ready to send back to the exchange as soon as they receive a message as long as the incoming message fits certain criteria. So, most of the 10 nanoseconds is probably just physical time it takes for the message to get to the FPGA, compare bits, and send something back.
I imagine it would be tricky to write the instrument container in (safe) Rust since it must hold a bunch of stable references. The concurrent data structure would probably be challenging as well since the same borrowing rules prevent the kind of "optimistic" lock-free operation (though keep note that, as written, the SeqLock & friends code is UB in C++).
@@isodoubletif ur working on similar project i would love to have a talk.. I want to start building these tools asap and sell to investors. I also got some clients ready.
For the SPMC Queue V2 at 45:00, why does he have an mVersion at all? If the block isn't valid until mBlockCounter has been incremented, then readers don't risk reading during a write, no? Or, if you are reading while it's writing, it's because you've lagged so hard that the writer is lapping you.
I don't see the reader code. My guess is that each reader has it's own head. If at the end of a read read_version != current_version, it means the reader lagged behind for at least a loop. How does the reader recovers from the situation? Discard all state and rejoin from the head? I have no idea.
@@BenGalehouse Just allocate an array large enough . If you know you have 100 instruments and 100 new created intraday on average ... just use normal vector preallocated for a size of 1k... that way you are sure you dont invalidate anything.
Record args in binary form, record format string only once. Use thread local buffers to avoid contention. NEVER rely on delegating work to another thread except for handing off full instrumentation buffers. View logs offline by reconstituting the data back into readable format. I've been using a system like this for 10-15 years. Logging overhead, if done wisely, can easily reach single digit nanoseconds per entry. Even lower if you consider concurrency of logging many threads simultaneously.
He mentioned this but all log events are produced to a shared memory queue, which is then consumed by a consumer that then publishes it to, say, TimeseriesDB. Using the SeqLock idea, publisher isn't blockable by consumer, and the consumers are isolated from each other.
@@Michael_19056 Hi , why is using another thread for logging bad? let's say theoretically that we could garuntee that the logging thread will never thrash the same cache as the main fn, would it still interfere? & if the added instructions required to save that data "in the same breath" are so light that it only impacts on the nanosecond scale, does it become complicated to implement?
@@_RMSG_ sorry, I only saw your reply just now. In my experience, it would take longer to delegate the data to another thread than to simply record the data with the current thread. Again, the most efficient approach is to use a thread_local buffer to copy the arguments into so there is not locking or synchronization required for the thread to log its own args.
I was wondering about that too. If the wire between the exchange server and the clients' machines is more than 1.5m long then it's not even physically possible. He has to mean the wire-to-wire latency
I work at a similar trading firm to Optiver and when we measure trigger to trade the trigger is the time at which we see the exchange's packet on our switch. I think it's standard in the business to refer to it like that as no client of the exchange can see the packet before it reaches the client's switch.
My question is std::unordered_map is still not very efficient because the pointer itself still lives in heap and you are getting one indirection at least because they are stored as pointer to a pointer in the bucket. Am I mistaken somehow?
I was wondering the same thing. I have three theories. One, most instruments are added to the store at construction time (or in one large chuck) and the memory for the pointers are luckily allocated sequentially/contiguously which is easier due to the size of the the pointer being significantly smaller than the Instrument struct. And two, they know how the allocator they're using works or have implemented their own (they do say they don't include all the details), and know it will more likely allocate into contiguous addresses being made easier by the smaller size of the pointer vs the Instrument struct. Thirdly, they could reserve space for the map at construction time (again, they say they don't include all the details). Imo, reserving space for this seems pretty straightforward and I would imagine they could be doing something like this. Would be easier to tell if we knew how dynamic the number of instruments is... but... I imagine for a given application it is relatively consistent and is something that would be configurable or deducible. Good chance that I'm missing something too, but these are just my thoughts.
Yes, you are right in that it is a couple jumps, but this is missing the bigger picture about what this design choice accomplishes. Better locality. You want the data in your program to be close together. Everything on your computer wants the data to be close together. Your hardware, if it sees you make consecutive memory accesses, WANTS to preload a big chunk of memory. Your page table address converter wants you to be playing in the same few pages so you don't have to do an expensive page table walk. Your L2/L3 cache don't want to have to constantly be cleaning themselves out. And so part of the game is the tiny optimizations - the instruction level battle (such as avoiding the indirection that you mention). But individual instructions are so fast anyways - all your latency in a single threaded program like this is really coming from TLB lookups and calls to RAM.
What is this "auto _" at ua-cam.com/video/8uAW5FQtcvE/v-deo.html ? Is this underscore just a way to say "not needed variable" or there is something new in C++ syntax?
@@MeetingCPP thanks for the clarification. I remember that some years ago even use of '_' prefix in variable name in C/C++ was reserved for language system needs, now even the '_' alone is used. Funny usage, although.
@@doctorshadow2482 Well, its not a C++ invention, I've seen this used as a popular place holder variable (because its needs a name) in code snippets of other programming languages.
He was very knowledgeable but his presentation was not very good. He should have slowed down his thought process for people like me who are not familiar with the subject matter so that we can follow him. But should thank you anyway for the things I picked up from his talk; like the stable vector data structure.
If this man is writing a book; something like "introduction high performance trading"; then I am buying it !
Do you know any material or anyone who publishes regarding the subject?
@@payamism Quant Galore
this guy is totally clueless and your are even more clueless..
@@boohoo5419 how so?
@@boohoo5419 "your are"
Interesting and always love speakers who give "further reading".
Same thought
Array oriented designs are a the core of the low level model of a trading system. And while this array view is much of what this talk is about, it is important enough to reemphasize it. Also, template based objects are handy to glue your arrays together so as to fully optimize the result.
Is it what similar to data oriented programming in game?
@@sui-chan.wa.kyou.mo.chiisai Yes.... DOP rules over all... OOP to the trash 😋
@@santmat007FINALLY CANCER OOP IS DEAD GO GO FOP AND DOP 🦀
Can you recommend any book(or books) specifically for this stuff?
At 23:00, when stable_vector is explained (built using boost::container::static_vector), just mentioning additional info for viewers. boost::container::deque has a feature to allow configuring the chunk size (called block size in Boost).
Amazing talk, Learnt a lot , shoutout to you!
I think there is a flaw with this design. Since the spmc queue supports variable-length messages, if a consumer is lapped by the producer, the mVersion field the consumer thinks it is spinning on is probably not the version counter field at all. It may well be spinning on some random bytes right in the middle of mData. Then if the random bytes happen to be the version of the consumer is expecting(although the probability is very low), it could be disastrous. The customer does not know it is lapped at all, and continues processing with the meaningless data.
in the slide SPMC Queue V1, there is a line assuming the mCurrent buffer format is
std::memcpy(&size, mCurrent+sizeof(MessageSize), sizeof(MessageSize));
it sounds like a coy paste error, what actually you mean is
std::memcpy(&size, mCurrent, sizeof(MessageSize));
Where is the link to the code for Seqlock and SPMC shared in the talk ?
Excellent talk, I learned a lot!
can anyone explain how the write is setting the lowest bit to 1, is this a design feature of std::atomic? 34:23
Note that the write increments a counter by one, copies, then increments by one again. So if the consumer reads in the middle of writing, the counter is odd (or the lowest bit is 1). Only when it is done writing is it even again
Remember his logic is that if the mVersion is odd, then it's currently being written. (int & 1)==0 is just an ugly version of an "is even" function.
@@Alex-zq1yy What happens in the scenario where we have 2 writers: Writer A increments a counter by one, and is now writing. Then, while the writing is in progress, Writer B increments value by one as well (to then start writing). Now, before Writer A increments the counter again, consumer reads and counter is even, despite none of the writes being completed.
Wouldn't that be possible to happen? Perhaps the full implementation also checks preemptively if the lowest bit is 1. Then this problem wouldn't exist.
@@gabrielsegatti8017 The code comment says one producer multiple consumers, so there can't be two or more writers.
It just adds one, if the number is odd then a write is in progress. What is confusing however is the assembly shown, since it would not work with the add 2.
It would fail on the load, but then the if statement he shows is not needed. And when the load fails the data would be inconsistent, which I guess is by design?
I am pretty certain that SeqLock doesn't work with multiple producers - if one than one producer is in Store() the memcpy can be overlapped - leading to a corrupted value.
E.g. Imagine T is a 1mb array. Producer 1 copies 512k, gets descheduled, Producer 2 copies 1mb, Producer 1 is rescheduled and copies the remaining 512k. Now T contains half of the value from producer 1 and half of the array from producer 2. In order for this to work, the memcpy would need to be atomic/synchronized - and it still would not provide an ordering consistent with the mVersion.
Not to mention that if two producers increment the mVersion, the reader will see it as stable/valid even though neither memcpy has completed even if ordered.
The provided code is only valid for a single producer.
Of course, the provided code is simplified. The key observation is that the lock is considered to be held by a producer if the version number is odd, and it is free if the version number is even. To support multiple producers, you can interpret the code between the two version updates to be the critical section. The first producer to succeed in incrementing the version number from even to odd atomically acquires the lock. Other producers won't be able to enter the critical section because the version number they will see is odd, and they will have to wait until the first producer finishes updating the buffer and releases the lock
At 33:00, why does it use memcpy instead of a copy assignment?
copying large blocks of memory or large nested structs is more efficient using memcopy.
@@RayZde Can't someone overload assignments for structs such as those to ensure the use of memcopy?
Because a copy assignment might have control flow and branches.
Imagine this, while the copy assignment is executing in the reader, a 'write' operation is taking place on another thread. At first glance that might seem OK since the value will be discarded when the version check fails in the reader. However, it is dangerous because it might result in unpredictable state in the logic.
For example:
if (member_ptr != nullptr) { use_member(*member_ptr); }
You can see how the check can pass and before the body of the if-statement executes, the writer would assign nullptr to member_ptr and boom you crash.
So, the solution is to either do memcpy and hope it works at all, if not it will crash spectacularly most of the time, which should be a good indication you're doing something wrong. Or a better solution is to constrain the template parameter to be trivially_copyable
@@RayZde no this has nothing to do with efficiency. It's about correctness. check my reply to the OP.
@@shakoooskin all these cases we are talking about struct of data, in the C sense, POD data. Default copy constructor or memcpy are the same, they both will be inlined anyway.
Nice! Some examples of discussion of queues for few producers and many consumers.
I really didn't understand the 10 nanosecond latency.
Anyone here could help?
It says on the diagram that they have a trigger price at the FPGA... so I'm assuming they have something ready to send back to the exchange as soon as they receive a message as long as the incoming message fits certain criteria. So, most of the 10 nanoseconds is probably just physical time it takes for the message to get to the FPGA, compare bits, and send something back.
either that or a commenter below is correct and the 10ns just represents the time at the fpga
how does rust compare to this?
trash
I imagine it would be tricky to write the instrument container in (safe) Rust since it must hold a bunch of stable references. The concurrent data structure would probably be challenging as well since the same borrowing rules prevent the kind of "optimistic" lock-free operation (though keep note that, as written, the SeqLock & friends code is UB in C++).
@@isodoubletit's safer tho for data and it's also good performance like C++ I just don't know if it performs just as good as C++?
@@isodoubletif ur working on similar project i would love to have a talk.. I want to start building these tools asap and sell to investors. I also got some clients ready.
For the SPMC Queue V2 at 45:00, why does he have an mVersion at all? If the block isn't valid until mBlockCounter has been incremented, then readers don't risk reading during a write, no? Or, if you are reading while it's writing, it's because you've lagged so hard that the writer is lapping you.
I don't see the reader code. My guess is that each reader has it's own head. If at the end of a read read_version != current_version, it means the reader lagged behind for at least a loop. How does the reader recovers from the situation? Discard all state and rejoin from the head? I have no idea.
Is there a link with the code? Or it is so proprietary that it cannot be shared ?
What advantage does stable_vector provide that std::array does not?
The ability to add additional elements. (without starting over and invalidating existing references)
@@BenGalehouse Just allocate an array large enough . If you know you have 100 instruments and 100 new created intraday on average ... just use normal vector preallocated for a size of 1k... that way you are sure you dont invalidate anything.
The stable vector shown here has constant lookup time if im not mistaken so thats a big advantage
"its hard to crack the S&P 500"
Explain that to congress.
im out of the loop, has congress been trying to crack the s&p500? would appreciate an explanation why this comment gets likes
@@DylanSmith-vj7qo He probably means they trade using insider information and thus beat it easily.
std::memcpy is not data-race safe as per the standard. you could use std::atomic_ref to read/wrie individual bytes of the object.
Very nice, I'm curious how do you log in production without sacrificing performance ?
only memcpy raw args in the main thread and let the logging thread format to the string and create the logs
Record args in binary form, record format string only once. Use thread local buffers to avoid contention. NEVER rely on delegating work to another thread except for handing off full instrumentation buffers. View logs offline by reconstituting the data back into readable format.
I've been using a system like this for 10-15 years. Logging overhead, if done wisely, can easily reach single digit nanoseconds per entry. Even lower if you consider concurrency of logging many threads simultaneously.
He mentioned this but all log events are produced to a shared memory queue, which is then consumed by a consumer that then publishes it to, say, TimeseriesDB. Using the SeqLock idea, publisher isn't blockable by consumer, and the consumers are isolated from each other.
@@Michael_19056 Hi , why is using another thread for logging bad? let's say theoretically that we could garuntee that the logging thread will never thrash the same cache as the main fn, would it still interfere? & if the added instructions required to save that data "in the same breath" are so light that it only impacts on the nanosecond scale, does it become complicated to implement?
@@_RMSG_ sorry, I only saw your reply just now. In my experience, it would take longer to delegate the data to another thread than to simply record the data with the current thread. Again, the most efficient approach is to use a thread_local buffer to copy the arguments into so there is not locking or synchronization required for the thread to log its own args.
Which exchange can give you trigger to trade at 10ns? You probably not mean the exchange timestamp but more your capture timestamp on your wire.
Haha, I see you are here as well.
I was wondering about that too. If the wire between the exchange server and the clients' machines is more than 1.5m long then it's not even physically possible. He has to mean the wire-to-wire latency
I work at a similar trading firm to Optiver and when we measure trigger to trade the trigger is the time at which we see the exchange's packet on our switch. I think it's standard in the business to refer to it like that as no client of the exchange can see the packet before it reaches the client's switch.
The secret to low network latency is to be co-located in the exchange's data center. Even ten years ago it was worth it.
@@davejensen5443 Occam's razor
Optiver is market maker so requirements are a bit different but generally speaking trading at these time scales is just noise
My question is std::unordered_map is still not very efficient because the pointer itself still lives in heap and you are getting one indirection at least because they are stored as pointer to a pointer in the bucket. Am I mistaken somehow?
I was wondering the same thing. I have three theories. One, most instruments are added to the store at construction time (or in one large chuck) and the memory for the pointers are luckily allocated sequentially/contiguously which is easier due to the size of the the pointer being significantly smaller than the Instrument struct. And two, they know how the allocator they're using works or have implemented their own (they do say they don't include all the details), and know it will more likely allocate into contiguous addresses being made easier by the smaller size of the pointer vs the Instrument struct. Thirdly, they could reserve space for the map at construction time (again, they say they don't include all the details).
Imo, reserving space for this seems pretty straightforward and I would imagine they could be doing something like this. Would be easier to tell if we knew how dynamic the number of instruments is... but... I imagine for a given application it is relatively consistent and is something that would be configurable or deducible.
Good chance that I'm missing something too, but these are just my thoughts.
Yes, you are right in that it is a couple jumps, but this is missing the bigger picture about what this design choice accomplishes.
Better locality. You want the data in your program to be close together. Everything on your computer wants the data to be close together. Your hardware, if it sees you make consecutive memory accesses, WANTS to preload a big chunk of memory. Your page table address converter wants you to be playing in the same few pages so you don't have to do an expensive page table walk. Your L2/L3 cache don't want to have to constantly be cleaning themselves out.
And so part of the game is the tiny optimizations - the instruction level battle (such as avoiding the indirection that you mention). But individual instructions are so fast anyways - all your latency in a single threaded program like this is really coming from TLB lookups and calls to RAM.
Wow ! So great. You help me a lots.
Love you sir from India Bihar ❤❤❤
What is this "auto _" at ua-cam.com/video/8uAW5FQtcvE/v-deo.html ? Is this underscore just a way to say "not needed variable" or there is something new in C++ syntax?
_ is the variable name, its in this case '_'. Likely because its not even used.
@@MeetingCPP thanks for the clarification. I remember that some years ago even use of '_' prefix in variable name in C/C++ was reserved for language system needs, now even the '_' alone is used. Funny usage, although.
@@doctorshadow2482 Well, its not a C++ invention, I've seen this used as a popular place holder variable (because its needs a name) in code snippets of other programming languages.
The slides don't exist at the URL.
Seems like the speaker didn't share them. :/
Undamped systems ALWAYS devolve to chaos...
I want to know your tick-to-order and jitter.
Optiver's competitors beware!
❤😂🎉😅看着发量,就是高手👍
who can print momey the fastest, same thing, 😉
He was very knowledgeable but his presentation was not very good. He should have slowed down his thought process for people like me who are not familiar with the subject matter so that we can follow him. But should thank you anyway for the things I picked up from his talk; like the stable vector data structure.
Not every talk is aimed at beginners