Great video, I used your package and modified it a bit to my liking. You do have one correction to make though, transf = vo.get_pose(q1, q2) can return an infinite value in [0,3] and [1,3], especially when a video shows something stopping. Adding transf = np.nan_to_num(transf, neginf=0,posinf=0) fixes the issue.
Nice overview of a VO process. Well done. However, I think the use of using a unit translation in getting your second set of homogenous coordinates and then using those points to ultimately calc scale will always give you something close to 1. This issue is hidden by the fact that the ground truth in your dataset changes by approximately one unit per frame. You can check this by skipping several frames. The displacement should be greater but your algorithm will still give a scale of about one.
Thanks for this excelent tutorial. I have a question. Basically, I took a look at your approach to see if I could get my pose estimation relative scale working correctly. It didn't. The relative scale computed is, most of the time, a number close to 1. Any recommendations on this?
So, it makes sense to me how you calculate the relative scale, but not how you apply it? Normally, I would say that once you calculated the relative scale r, you then scale the translation vector t between the two camera frames by r*t in order to get the correct scale between the frames. But here you use it only for the chirality check instead? What is the reason for adding the relative scale to the number of points that have positive depth? I fail to see the reason behind using the relative scale like this.
so what is the difference between optical flow and visual odometry? which one is better to use for real-time "location" estimation and navigation for drones for example?
There are some similarities. Optical flow can also be used to track features from frame to frame. What to use depends on your feature extractor and system and so on. I'm gonna make a stereo visual odometry video too where I use optical flow to track feature points
You can recover scale by e.g. introducing the assumption that the points you use are from the street in front of you. Did that to stabilize a self balance robot with VO (there is a Video on my account)
@mertipolati with monocular odometry you need to know something. E.g. distance from your camera to the surface. There is no other way. So basically like done here people.inf.ethz.ch/pomarc/pubs/SaurerVICOMOR12.pdf but you do know d in Eq (1)
Hey Nicolai, can you please share any resource from where i can learn to integrate bundle adjustment to this code(basically get vslam working)? Thanks for the tutorial.
In the decompose_essential_mat, the technique you used in finding the correct [R,t] pair when decomposing essential matrix, is it a heuristic method that you implemented from scratch or is there a published paper explaining the method?
why are we taking 0th and 2nd index of the translation vector in "gt_path.append((gt_pose[0, 3], gt_pose[2, 3]))"? Isnt that (tx, tz) whereas we need (tx,ty)? or is it because in 2D world with respect to camera, z of camera is y of real 2d world?
Thank you so much for the great video! Just one question: If we would end up not using the KITTY dataset, then how would we go about creating the first Projection matrics used at 32:08 ? (self.P)
Instead of using decompose essential matrix and then finding the correct pose by trigulation of points, can I just use cv2.recoverPose() method. It does the same thing by itself.
Thank you for your video! How can we create VO for 360 cameras like insta360 x3 if at all possible? Also, is calibrating such a camera possible (equirectangular images)?
Nice video, trying to do it on my own data. Aren't the extrinsic parameters different for each image, so how is it possible that you can use it for your whole image sequence?
Hi Nicolai! I've been following your channel since a long time and have learned quite a lot of things from you since I started following you... I am texting you because the project I am stuck on this time is by far the hardest one I've ever come across. It's a freelance project I got from somewhere and seems like I have exhausted all my options on how to actually get it done. I need your help with the project or at the very least suggestions on how I can approach the problem statement or solve it. So, I'll give a brief summary of the project:- I have to come up with a system that can map football players from the video frame to a 2D field image and get their velocities, acceleration etc stuff. I have used yolov7 for detection of the players from the video frame and using euclidean distance to keep track of the centroid of the selected player. Now, I want to be able to design a system to map this player on a 2D field image and get the player's acceleration and velocity. I tried perspective transform but it does not seem feasible as I will have to click on four separate corners every frame if I want to map. I want this process to be automated. Is there any way you can help me? note:- throughout the video the camera angle will not stay constant it will keep on changing. It's a ptz camera. Please help me with the above. Thank you.
Here's a step-by-step approach to achieve this: Camera Calibration: Perform camera calibration to obtain the camera's intrinsic matrix and distortion coefficients. You can use a chessboard pattern and OpenCV's cv2.calibrateCamera function for this. For a PTZ camera with a variable field of view, you may need to calibrate the camera multiple times as the camera angle changes. Object Detection and Tracking: Use YOLOv7 or any other object detection algorithm to detect football players in video frames. Extract their bounding boxes. Implement an object tracker (e.g., Kalman filter, CentroidTracker) to track players between frames based on their bounding boxes. Perspective Transform (Bird's-eye view): Obtain the 2D field image that you want to map the players onto. Define four points on the field image corresponding to the four corners of the field. Implement an automatic method (e.g., feature matching) to estimate the perspective transformation between the field image and the camera view in each frame. Optical Flow: Use optical flow algorithms (e.g., Lucas-Kanade, Farneback) to estimate the motion vectors of players between consecutive frames. Based on the motion vectors and the camera's frame rate, calculate the players' velocities and accelerations in the 2D field coordinate system. Combine Data: Using the perspective transform, map the players' positions from the camera view to the 2D field image. Combine the positional data with the calculated velocities and accelerations to obtain the desired player tracking information. Might help !!!
@@NicolaiAI I was intrigued by the scale calculation using triangulation, and then estimating R and t. Without using the initial R and t using inbuilt cv2.recoverPose
Great tutorial seriously its great thanks for putting this out !!!! I have one question when i==0 of pose estimation we are using cur_pose = gt_pose and after that in i==1 we are using cur_pose = np.matmul(cur_pose, np.linalg.inv(transf)) so in second iteration we are using the pose we have from ground truth and multiplying it pose we have calculated. What if we dont have ground truth. how will we calculate the cur_pose for i=1 then? thanks in advance
Great stuff ! Very good to start with the theory part to actually understand what's happening in the code. Question: could the cv2 function : 'cv2.recoverPose(E, q1, 12, K)' be used to get directly the R and t matrices from the Essential and K matrices ? Thanks !
Hello, I tried to feed live frames into the code. However, it yielded very high constant bias and noises. Is there any way to reduce the constant bias and noises? Thanks!
Thanks for watching! If u want to find the poses of ur own data u can just replace the images with ur own. But then u won't have the ground Truth poses
Is the ground truth pose mandatory? Does the code works without ground truth text file? I read bout it the other day and it seemed to be important to obtain scale information?
I have used this code with my own 1350 input images The only problem I'm facing is that this model is not able to run the images sequentially What I've noticed is it first runs a few images sequentially (say 30-50) and then it goes back to the start (0-10) I don't know what to do Please help
Join My AI Career Program
www.nicolai-nielsen.com/aicareer
Enroll in My School and Technical Courses
www.nicos-school.com
Great video, I used your package and modified it a bit to my liking.
You do have one correction to make though, transf = vo.get_pose(q1, q2) can return an infinite value in [0,3] and [1,3], especially when a video shows something stopping.
Adding transf = np.nan_to_num(transf, neginf=0,posinf=0) fixes the issue.
Thank you for this amazing material! Can't wait to see the next steps with bundle adjustment 🙂
Thanks for watching!
This is amazing, I will definitely use this in my final course work.
Thanks a lot!
Nice overview of a VO process. Well done. However, I think the use of using a unit translation in getting your second set of homogenous coordinates and then using those points to ultimately calc scale will always give you something close to 1. This issue is hidden by the fact that the ground truth in your dataset changes by approximately one unit per frame. You can check this by skipping several frames. The displacement should be greater but your algorithm will still give a scale of about one.
@34:02
Thanks! Found your channel today and it's an absolute gem for beginner computer vision researchers.
Thanks a lot for the nice words!
Really helpful vid! Might hav some questions later on tho. Thanks man and keep up the amazing work!
Thank you so much! Really appreciate it and feel free to ask whatever questions u have
Amazing tutorial bro !!. Keep it going.
I have started my master's thesis project in VIO so this is quite interesting. I will use sensor fusion thought and not just VO.
Thanks man this is awesome Keep up the good work ;)
Thanks a lot man!
Very cool!
Thank You for this video.
Thanks for watching! Hope that it can help u
Hey ! Awesome video. Just a tiny question, I was not able to find the image_r folder in the KITTY datasets. Could anyone help with that
Thanks for this excelent tutorial. I have a question. Basically, I took a look at your approach to see if I could get my pose estimation relative scale working correctly. It didn't. The relative scale computed is, most of the time, a number close to 1. Any recommendations on this?
So, it makes sense to me how you calculate the relative scale, but not how you apply it? Normally, I would say that once you calculated the relative scale r, you then scale the translation vector t between the two camera frames by r*t in order to get the correct scale between the frames. But here you use it only for the chirality check instead? What is the reason for adding the relative scale to the number of points that have positive depth? I fail to see the reason behind using the relative scale like this.
so what is the difference between optical flow and visual odometry? which one is better to use for real-time "location" estimation and navigation for drones for example?
There are some similarities. Optical flow can also be used to track features from frame to frame. What to use depends on your feature extractor and system and so on. I'm gonna make a stereo visual odometry video too where I use optical flow to track feature points
great work! would love to see move videos on the topic of optical vision where the camera is moving
Thanks for watching! Will definitely do more of those
THANK YOU!!!
Thanks for watching!
You can recover scale by e.g. introducing the assumption that the points you use are from the street in front of you. Did that to stabilize a self balance robot with VO (there is a Video on my account)
How did you implement this assumption? I'm trying to recover the scale and got stuck with this problem.
@mertipolati with monocular odometry you need to know something. E.g. distance from your camera to the surface. There is no other way. So basically like done here people.inf.ethz.ch/pomarc/pubs/SaurerVICOMOR12.pdf
but you do know d in Eq (1)
Hey Nicolai, can you please share any resource from where i can learn to integrate bundle adjustment to this code(basically get vslam working)? Thanks for the tutorial.
In the decompose_essential_mat, the technique you used in finding the correct [R,t] pair when decomposing essential matrix, is it a heuristic method that you implemented from scratch or is there a published paper explaining the method?
I'm interested in learning the possibilty of applying visual odometry as an initial step to camera matchmoving. any thoughts?
why are we taking 0th and 2nd index of the translation vector in "gt_path.append((gt_pose[0, 3], gt_pose[2, 3]))"?
Isnt that (tx, tz) whereas we need (tx,ty)?
or is it because in 2D world with respect to camera, z of camera is y of real 2d world?
Thank you, do you have any idea how to implement this with an other feature detector ? i tried with sift it didn't works so well
Thanks for watching! Yeah u can use all the feature detectors from opencv
Thank you so much for the great video! Just one question:
If we would end up not using the KITTY dataset, then how would we go about creating the first Projection matrics used at 32:08 ? (self.P)
Camera calibration. Thanks a lot for watching!
Hi so we can convert our camera calibration value to a 3 by 4 matrix to get the self.p right @@NicolaiAI
Amazing video really great job, will you implement slam as well ?
Instead of using decompose essential matrix and then finding the correct pose by trigulation of points, can I just use cv2.recoverPose() method. It does the same thing by itself.
wow, great!!! Can I use raspberry pi cam for that ?
Great video, have you done the video of optimizations as well?
Great video. However, why is your algorithm not calculating the Z direction of the pose??
It is, but only x and y are visualized
Oh great.
Thank you for your video!
How can we create VO for 360 cameras like insta360 x3 if at all possible? Also, is calibrating such a camera possible (equirectangular images)?
Nice video, trying to do it on my own data. Aren't the extrinsic parameters different for each image, so how is it possible that you can use it for your whole image sequence?
Have it runnning in another video with live cameras
Hi Nicolai! I've been following your channel since a long time and have learned quite a lot of things from you since I started following you... I am texting you because the project I am stuck on this time is by far the hardest one I've ever come across. It's a freelance project I got from somewhere and seems like I have exhausted all my options on how to actually get it done. I need your help with the project or at the very least suggestions on how I can approach the problem statement or solve it. So, I'll give a brief summary of the project:-
I have to come up with a system that can map football players from the video frame to a 2D field image and get their velocities, acceleration etc stuff. I have used yolov7 for detection of the players from the video frame and using euclidean distance to keep track of the centroid of the selected player. Now, I want to be able to design a system to map this player on a 2D field image and get the player's acceleration and velocity. I tried perspective transform but it does not seem feasible as I will have to click on four separate corners every frame if I want to map. I want this process to be automated. Is there any way you can help me? note:- throughout the video the camera angle will not stay constant it will keep on changing. It's a ptz camera. Please help me with the above.
Thank you.
Here's a step-by-step approach to achieve this:
Camera Calibration:
Perform camera calibration to obtain the camera's intrinsic matrix and distortion coefficients. You can use a chessboard pattern and OpenCV's cv2.calibrateCamera function for this.
For a PTZ camera with a variable field of view, you may need to calibrate the camera multiple times as the camera angle changes.
Object Detection and Tracking:
Use YOLOv7 or any other object detection algorithm to detect football players in video frames. Extract their bounding boxes.
Implement an object tracker (e.g., Kalman filter, CentroidTracker) to track players between frames based on their bounding boxes.
Perspective Transform (Bird's-eye view):
Obtain the 2D field image that you want to map the players onto.
Define four points on the field image corresponding to the four corners of the field.
Implement an automatic method (e.g., feature matching) to estimate the perspective transformation between the field image and the camera view in each frame.
Optical Flow:
Use optical flow algorithms (e.g., Lucas-Kanade, Farneback) to estimate the motion vectors of players between consecutive frames.
Based on the motion vectors and the camera's frame rate, calculate the players' velocities and accelerations in the 2D field coordinate system.
Combine Data:
Using the perspective transform, map the players' positions from the camera view to the 2D field image.
Combine the positional data with the calculated velocities and accelerations to obtain the desired player tracking information.
Might help !!!
Hey can you please share the article or paper from where the theory is taken
Have not used a specific article or paper
@@NicolaiAI I was intrigued by the scale calculation using triangulation, and then estimating R and t. Without using the initial R and t using inbuilt cv2.recoverPose
Great tutorials! May I ask what is the full pipeline on 8:43? The part after Local optimization is occluded by your handsome face 😆
Whats the use of getting a pose with out scale?
Actually we do. We take the relative scale into account. I go over that in the code
Great tutorial seriously its great thanks for putting this out !!!! I have one question
when i==0 of pose estimation we are using
cur_pose = gt_pose
and after that in i==1 we are using
cur_pose = np.matmul(cur_pose, np.linalg.inv(transf))
so in second iteration we are using the pose we have from ground truth and multiplying it pose we have calculated.
What if we dont have ground truth. how will we calculate the cur_pose for i=1 then?
thanks in advance
nevermind i got the ans in your live camera trajectory video. thanks a bunch
@@poproduction3994can u tell the solution?
Monocular camera odometry suffers with scale drift right? the Pose(R,T) doesn't have any units here right?
It does. We take the relative scale into account. I go over that in the code. But it Will be another accumulating error for the odometry
Can you make a video on visual slam??
Thanks
Thanks for watching! Hope that u can use it
Great stuff ! Very good to start with the theory part to actually understand what's happening in the code.
Question: could the cv2 function : 'cv2.recoverPose(E, q1, 12, K)' be used to get directly the R and t matrices from the Essential and K matrices ?
Thanks !
How did you create the calibration and poses txt file. Is there any code for that? Please share if it is there
That's from the KITTI dataset
Thank you for these videos man. I really appreciate it.
I forked the repo to replicate results. Where can we get the "lib" module?
It's a folder next to the python script, in his github
Hello,
I tried to feed live frames into the code. However, it yielded very high constant bias and noises. Is there any way to reduce the constant bias and noises?
Thanks!
U Can use different filters. Try out with a low pass filter to start with
Thanks! But I do have a question. How do we obtain the pose data of our own without referring to kitti datasets?
Thanks for watching! If u want to find the poses of ur own data u can just replace the images with ur own. But then u won't have the ground Truth poses
Is the ground truth pose mandatory? Does the code works without ground truth text file? I read bout it the other day and it seemed to be important to obtain scale information?
@@TheWeibing it's not mandatory but then u kinda don't know how ur system performs
@@NicolaiAI Hey just wondering what does the 2nd Row, 4th Colum term in the output transformation matrix represents? [x, ?, y]
Caca
what is the name of the github repo_
I have used this code with my own 1350 input images
The only problem I'm facing is that this model is not able to run the images sequentially
What I've noticed is it first runs a few images sequentially (say 30-50) and then it goes back to the start (0-10)
I don't know what to do
Please help
can i run this on ROS for a NAO robot?
Where can I get the link for his discord server?
can it also work with object tracking?
Bro didn't used pnp ?
Nope
@@NicolaiAI ok thanks
Something is strange, your method does not estimate the magnitude of translation (only its direction) and somehow it is pretty close to ground truth
Nope the translation is the magnitude. The whole transformations of the camera poses are estimated
@NicolaiAI how can i get the dataset