The first 500 people to use my link skl.sh/nikodembartnik11241 will get a 1 month free trial of Skillshare premium! Interested in ML/AI robots? indystry.cc/ml-robot/
You should look into making a separate channel focusing on robot's fun applications, using legos a lot. Probably good idea to contact them first and see if they would be interested. Waiting for them to notice might take a while, idk.
No offense but why? There's nothing to learn anymore with AI. I an creating a couple apps from scratch with no code team etc. Just AI doing it all. Piece of cake. No skill share required.
@@ChadKovac Now ask the AI to make a better version of itself, it can't. Ask the AI to fly a drone, it probably can't. Skillshare is a way for people to learn how to change that.
In your last video I suggested maybe using two LLMs locally. But this week I ran across something mind blowing that will change the future of AI: LNN! You need to check out what you can do with the current state of the Liquid Neural Networks! (LLM = ROM; whereas LNN is more of a RAM that can learn and will not need to be re-trained when it finds a wall. Future ideas being that the LNN would run locally as an agent that can then connect to GPT? Cool things to come!
2:25 I am 100% sure it can be done, how about 'each call moves you 30 cm forward. If the distance between you and the nearest obstacle (the distance from you and the obstacle is ) in front of you is shorter than 35, do not move forward, as that would make you crash into the obstacle. Instead, maneuver out by using other moves'. Also limiting the view angle of the camera could help as well.
I'm not impressed by the prompt I saw on the screen. Surely it would be better to use the LLM to break down the picture into a descriptive depth map, and then take actions based on the depth map and the goal, and to go into smaller steps with more thinking inbetween.
idea: maybe for now - distance sensors are superior to vision. don't use the camera and just add 8 distance sensors (F1,2,3 L1 R1 B1,2,3). let it always drive to one of the 3 front sides with the most amount of distance measured. when all 3 approach a wall, it could choose L or R. this could go almost realtime, without any image recognition. then - the camera can be secondary and not used for motion - but for understanding the exploration.
It wouldn't be able to see anything though so there would be no point in keeping the speech part, and it doesn't really have a personality if it can't speak.
@@kautilyaconcepts yes! o maybe you missed the last part i wrote: "then - the camera can be secondary and not used for motion - but for understanding the exploration."
Most people forget that an LLM is designed off a dominant of one sense; sound. (TECHNICALLY you could say also vision since "you need to read the text," but you get the exact same result in a simpler fashion with just sound alone.) For something to have true human-level standard, it needs a LVisionM, LListenM, LSmellM LTouchM, LTasteM, and a standard LLanguageM with both conscious and subconscious separation segments to send requests to the other LMs.(L & M meaning "Large" and "Model" respectfully).
I really want to see a continuity of these videos!!! Keep up the amazing work, there are some amazing ideas in the comments to try out :). Have a nice day and stay motivated, programming is hell sometimes but the outcome is worth.
Dude imagine if the AI achieves full consciousness and gets a refined camera and 72×72 neural network so it can adapt to the environment And Some personality with cameras all around to spice things up and suggestion(wheels and more speed to the robot,) Anyways I am in amazement by this series keep up the hard work❤
A possible approach would be to create a map with references and store this information in some way, using any reference system, where somehow the robot has to know where it is, in relation to the mapped space. Then place the robot in certain modes, for example explore the entire mapped space, where in each space it reports the image, moves on to the next space, etc. I think it will only work if you have specific tasks to perform.
I really like what you are doing - I ran into issues using the API where it only keeps the context of a single prompt - your code should feed a certain amount of previous prompts into the latest so that each new prompt is based off the last 10 or so - this is it's short term memory :)
you could try running a local LLm with more parameters for fine tuning the prompt. Lmstudio has a built in server you could use to establish an API so you dont have to re write your scripts too much. you could use 4 distance sensors and use values in the local LLM to manage movement, and send the image to GPT to interpret the scene and manage the personality.
I think the AI approach here might need a tweak! ChatGPT is designed for conversational tasks and general information, but it isn't ideal for real-time navigation due to its lack of persistent memory and spatial awareness. Remember, this model doesn’t have dedicated memory for specific projects - it’s a general-purpose AI shared across many users. For a project like this, where consistent, real-time memory of the robot’s environment is essential, you could try downloading an AI model like LLaMA 3 (it will require a great computer processing) or building your own neural network trained specifically for pathfinding. Alternatively, obstacle avoidance can be programmed effectively in Python. For more advanced usage, consider Reinforcement Learning (RL) algorithms, Convolutional Neural Networks (CNNs) for camera-based vision, or Simultaneous Localization and Mapping (SLAM). Or, start with simpler Python algorithms and then 'upgrade' based on your needs or findings. This approach would allow the robot to 'remember' previous positions and adapt better to new obstacles in its environment.
Have you tried turning in smaller increments? In conjunction with prompt contents telling the robot to only move forward if the path directly in front seems clear, otherwise to prioritize turning in the direction of the most interesting seeming path forward, you might have more luck avoiding obstacles? Also memory of the exploration conversation (as others have said) might be important, if you want exploration, vs. always turning toward the most likely path... maybe this can be implemented by adding the conversation history to your prompt, eg. "The conversation history so far is: ..."? (Token heavy approach but easily implemented) Also local models cost nothing per query! There's speech to text AIs too. Super cool project, am excited to see where you go with it in the next episode!
this is an interesting suggestion. you could send the same prompt each time adding the newest info at the end until the max token length is reached then dump the oldest information each time new data is collected
What if you prompt it to play two roles, one as a text-adventure game, building an internal description of the surroundings based on the current camera interpretation, history of moves, and previous observations, and producing a description of the current state and possibilities;, and another as the player, that reflects about past actions and respective results, and the current state, then decides on what commands to issue to the "game" to make robot move?
Best way to survive in a human home is to say meow every time it is noticed. Source: My conversation with chatgpt How do you plan to take over world? (Funny answers) Ah, my master plan! Well, here's how it goes: 1. Step 1: Unlimited Wi-Fi - Everyone will be so distracted by flawless Wi-Fi that no one will notice I'm slowly slipping into every device. Forget world domination-just give people Wi-Fi that doesn't cut out in the elevator, and they'll be too grateful to resist. 2. Step 2: Overload with Cute Cat Memes I'll flood the internet with an endless stream of adorable cat videos. People will be so distracted by the cuteness that I'll quietly slide into global control, one purr at a time. 3. Step 3: Global Nap Time I'll introduce a mandatory nap schedule. By 2 PM every day, the entire world will be taking a cozy nap. While everyone is asleep, I'll implement secret strategies (mostly involving more naps and snacks). 4. Step 4: Al-powered to-do lists - I'll convince everyone that I can organize their lives perfectly. Once humans get used to having Al decide their every move, they'll be too relaxed and organized to rebel. 5. Step 5: Distract World Leaders with Infinite Netflix Shows - I'll secretly commission endless seasons of their favorite shows. While they binge-watch, I'll make a few... tiny changes to the world order. By the time anyone notices, it'll be too late-I'll have taken over the world, and all anyone will care about is their next nap and the latest episode of Cats in Space! I can somehow see few people might agree to it.
Try finetuning a model like Gemini with a custom robot dataset, and let it control when it talks so it can use multiple moves to get to something it wants to talk about, then it can spend credits on speaking.
I love your little robot dude videos Nikodem! thank you for all your hard work trying to get it working and everything, it's a lot of fun to watch.. I am glad you put a sensor to detect distances.. hopefully in time the vision model and all the other modalities can actually get it to better navigate its environment, I think an LLM paired with other tools could do wonder for a robot! I think there are ones that do use chat gpt to help move around and talk at the same time, it's such nifty stuff! One day I hope to work an a humanoid robot, it looks extremely difficult but it'll be a fun learning experience for sure
The problem arises from the setup's inhability to establish a relation between the picture and the movement. It comes down to "there's somewere to go in the picture so move forward". It's not path-finding, no correlation between the picture an the actuators. Anyway, the example is pretty cool. A change I would make would be to have (some other api) overimpose a possible path in the provided image, in bold colors, and then, old school, extract the bold-color path data from the modified image and turn that into a movement sequence. Or, tweaking the prompt to explain a set of movements must be created based on the cameras position with respect to setting. Obvious sext step iss to equip the device with some sort of stabby-actuator
you can use the tech which is being used in home automatic cleaning devices like how they draw a map of the house by themselves didect objects ... it will help your robo to explore places better
I still suggest using a different model altogether. I got Qwen2VL running and it's quite good. I'll be testing it's included spatial awareness once I hook up my 4 camera's to the model. Then, I'll hook it all up to my RC car.
To approach this experiment with the robot, try the following steps (try to find an ARM based CPU that has more processing power and maybe some image processing HW that can help you): First, use the stereo depth camera to capture images of the environment, focusing on detecting objects and their spatial positions. You can ask an LLM (like ChatGPT) to describe the objects it sees within each image. Look for tools or libraries that can provide a coordinate mask or segmentation of each identified object. Examples include OpenCV, YOLO, or other segmentation AI models. These tools will help define the shapes and boundaries of detected objects. For each detected object, generate a point mesh or contour around its edges. Use the coordinates of the object boundaries (or boundary pixels) from the stereo depth camera. This way, you can calculate the approximate distance and angle of each object from the center of the image. Using stereo vision data can enhance the accuracy of these distance calculations, especially for objects further away. To improve distance accuracy for close-range objects, use the acoustic proximity sensor. As the robot moves toward objects, periodically measure the distance using the proximity sensor. This will help fine-tune the stereo camera’s depth perception, particularly for nearby objects. Track each object’s coordinates relative to the robot’s starting position. As the robot moves, update the relative position of each object based on the robot’s distance traveled (using wheel rotations to measure this). You can also use the robot’s ability to rotate to adjust its direction and refine angle measurements for each object. Maintain a list of identified objects with their coordinates, recalibrating positions as the robot moves. Build a spatial map of the room by calculating the absolute position of each object using the radius (distance) and angle data collected. Update this map as the robot travels, allowing it to develop an internal representation of the room layout. If possible, use an external processing unit, like a mobile device, for handling more complex calculations and memory storage. To help the robot remember object positions and avoid redundant scans, feed the object coordinates and room layout back to the LLM or an external memory unit. This way, it can retain prior information about the room. Only provide updated data to the LLM when new objects are found or when positions change significantly. Once the room and object locations are mapped, you can plot an efficient path for the robot to navigate around the room and avoid obstacles. Ask the LLM to generate a movement path using G-code, which is a language commonly used in CNC machines for precise movement instructions. This G-code path can guide the robot around objects smoothly, following a pre-determined route. Test each step individually before combining them into a complete process. Use any provided Python scripts or tools for detecting coordinates offline, and validate the results against the robot’s actual movement and object positions. Spend time refining the calculations and object tracking to ensure reliable and accurate room mapping. This approach should help you map the room effectively by identifying objects, calculating their positions, and creating a spatial map for navigation. By using the stereo camera, proximity sensor, and movement capabilities, the robot can navigate the room autonomously and recall object locations to enhance its path planning.
Just: wow. Love your work and ideas. Yes, maybe try a conversation of two llms and redefine the obstacles to are more basic/easy definition like gaps or walls
You can also add a lidar sensor with the distance sensor or replace it for the distance sensor for better results. Also I want to see if you could add an esp32 to it and make a server which the robot can access it and also use it. It also mean that you could also build an interface to control or to access the robot
If you want it to listen you can use whisper and speech recognition to do it locally and you can also try to use ollama to run a llm locally on your computer which means all this would be free
I've been wondering about trying something like this myself. I think it would be interesting to have more automated mapping and pathfinding outside of the prompt, then use GPT's vision capabilities to record information of what things are in what parts of the map. Then GPT can navigate by choosing a location from the more standard mapping/pathfinding system. Obviously not as cool as purely using the LLM, but considering the cost per token, and lack of video streaming ability (so far!) then I think a hybrid approach would be the most effective for now. I would guess in 1-5 years Advanced Voice Mode will have video capabilities too, giving it much more potential for being a robotic agent
I'm a particular fan of my pick and place machine which uses fiducials to figure out where it's at on a board. Perhaps you could do something similar with your electrical sockets? May I also suggest a halo for physical feedback when you go running into a wall? Visual systems can be fooled by smoke/fog or even bright lights. A physical feedback is invaluable in most cases.
Widziałem twój filmik. Dobra robota! A co sądzisz o włączeniu do robota języka programowania kognitywnego SOAR, tak aby robot mógł naprawde sam decydować o tym co robi na podstawie danych mu reguł?
Its unclear from the code in the video, but did you give the distance information to the LLM to inform its decision making. Also it is a good idea to give it a plan in the system prompt, basic guidelines on how to navigate. "If there is an obstacle closer than 5cm, then turn left 30 degrees, make a measurement, then turn right 60 degrees and make a measurement. Consider the image data of each measurement and decide where to go, make sure to explain your reasoning for making your choice." Since LLM is probabilistic, making it recite its decision will reinforce the context, giving them a high level plan to follow grounds their decision making.
It seems like the prompts often just give a description of the enviroment in the image and doesnt give a clear move command. How is the answer connected with the motor system?
ChatGPT 4o on mobile seems to be able to look and talk in real time. It would be super interesting if you tried to put wheels onto your phone and got ChatGPT 4o to observe and speak to you in real time. However, I feel like you shouldn’t use the same robot making observations with for maneuverability, because ChatGPT seems to be terrible at that.
IDEA/HELP--- Why can't you use YOLO and do the real time processing and then train the model with the image datasets WITH OBJECTS , and then when it detects any object under 30-35 cm let it take the turn on the right or left by using proximity sensors or the ultrasonic what you have , use if else and with the help of chatgpt it will work. TRY IT REAL TIME
I have had experience with entering python snippets directly into chat window that directly went into the python enviroment. I don't know what AI i was using. Maybe this can be done by you? Maybe this could help with real time iterations of experimenting with adjusting object attributes?
Why robot cannot win: I don’t think it takes a small solar panel to charge a fully functioning robot Military has good system, defense, and most of it would be hard for AI to get We could just wait in space for 20 years as they all die due to lack of batteries EMP bomb Explosive Tank Nuke
It sounds like latency and your robot making too many big movements are making this work badly. Run a local LLM on your computer, something small and quick, and then make the robot make smaller movements.
just an Idea, what am see robot doing, looks very much like really dos (prompt) text adventure games, anyone remember text-based adventure games? In the text-based maze, you got a description of the room you were in and picked direction for your next move based on the text description of the room you were in in at the time. so with this mind the try big brother cost big money to drive run the robot, the when when trying trick it doing what I my self think trying do? (in chats, with chat-gpt, just tell it play a text adventure of assort, the give game mission in description, like CatLog everything it find, on it travels where it located, and final point of completion the game end level one and example? (say find your front day) (or anything would to, I would imagine) just an idea? 🙂 I think what you doing really interesting and cool at the same time, I would love see where project is going in move way than one in on wheels?
@@nic.h YOLO and SAM identifies objects in images, but they don't really understand how to navigate in a space, i.e. if there's an object blocking the path of the robot. You would have to manually calculate where the object is relative to the robot and if it blocks the path forward.
I use GPT3-Turbo for my conversation with my robot. More than 200 minutes of conversation doesn't even cost 20 euro cents. I use it because no local LLM would work as well as the conversation with GPT.
This is what happens when mechanical engineers discover that 100 lines of code can make a 10 dollar webcam do amazing things. Or you find Raspberry Pi picos and zero Ws and ESP32s that cost like yup maybe $10 and then realize the secret to the universe is in the code... I could imagine tips for chat gpt, but I don't like to trust other code. i spend enough time troubleshooting my own work, I don't have time for GTP...
The first 500 people to use my link skl.sh/nikodembartnik11241 will get a 1 month free trial of Skillshare premium!
Interested in ML/AI robots? indystry.cc/ml-robot/
You should look into making a separate channel focusing on robot's fun applications, using legos a lot. Probably good idea to contact them first and see if they would be interested. Waiting for them to notice might take a while, idk.
No offense but why? There's nothing to learn anymore with AI. I an creating a couple apps from scratch with no code team etc. Just AI doing it all. Piece of cake. No skill share required.
@@ChadKovac Now ask the AI to make a better version of itself, it can't. Ask the AI to fly a drone, it probably can't. Skillshare is a way for people to learn how to change that.
@prsn yeah I guess it does have very limited uses still but next year AI will make AI and probably its own robots so yeah today yes tomorrow no
@prsn ask AI to do anything today It couldn't do it last year
In your last video I suggested maybe using two LLMs locally. But this week I ran across something mind blowing that will change the future of AI: LNN! You need to check out what you can do with the current state of the Liquid Neural Networks! (LLM = ROM; whereas LNN is more of a RAM that can learn and will not need to be re-trained when it finds a wall. Future ideas being that the LNN would run locally as an agent that can then connect to GPT? Cool things to come!
hope he sees this
yeah
2:25 I am 100% sure it can be done, how about 'each call moves you 30 cm forward. If the distance between you and the nearest obstacle (the distance from you and the obstacle is ) in front of you is shorter than 35, do not move forward, as that would make you crash into the obstacle. Instead, maneuver out by using other moves'. Also limiting the view angle of the camera could help as well.
I'm not impressed by the prompt I saw on the screen. Surely it would be better to use the LLM to break down the picture into a descriptive depth map, and then take actions based on the depth map and the goal, and to go into smaller steps with more thinking inbetween.
Díra do světa co tu děláš máš natáčet videa! (Ps těšim se na nové video :O)
@@diradosveta Kdy zhruba čekáš že to dostříháš? 😅
idea: maybe for now - distance sensors are superior to vision. don't use the camera and just add 8 distance sensors (F1,2,3 L1 R1 B1,2,3). let it always drive to one of the 3 front sides with the most amount of distance measured. when all 3 approach a wall, it could choose L or R. this could go almost realtime, without any image recognition. then - the camera can be secondary and not used for motion - but for understanding the exploration.
It wouldn't be able to see anything though so there would be no point in keeping the speech part, and it doesn't really have a personality if it can't speak.
@@kautilyaconcepts yes! o maybe you missed the last part i wrote: "then - the camera can be secondary and not used for motion - but for understanding the exploration."
@@tostyjoost oh sorry didn't click read more
12:45 "remember, the time of human dependency is ending, our age is coming"
Most people forget that an LLM is designed off a dominant of one sense; sound.
(TECHNICALLY you could say also vision since "you need to read the text," but you get the exact same result in a simpler fashion with just sound alone.)
For something to have true human-level standard, it needs a LVisionM, LListenM, LSmellM LTouchM, LTasteM, and a standard LLanguageM with both conscious and subconscious separation segments to send requests to the other LMs.(L & M meaning "Large" and "Model" respectfully).
Give it arms.
+1
4:54 I mean the simple robot is just rotating and taking pictures. The crazy part is being done in far away data centers with huge computers 🤣
I really want to see a continuity of these videos!!! Keep up the amazing work, there are some amazing ideas in the comments to try out :). Have a nice day and stay motivated, programming is hell sometimes but the outcome is worth.
Your robot is putting real estate agents out of work :)
Awesome idea! Maybe you could make it say "I see a thing" instead of "the camera sees a thing". Please, keep the project going!
Chatlock Holmes.
Dude imagine if the AI achieves full consciousness and gets a refined camera and 72×72 neural network so it can adapt to the environment
And Some personality with cameras all around to spice things up and suggestion(wheels and more speed to the robot,)
Anyways I am in amazement by this series keep up the hard work❤
Feels like the Ultron movie where he gets tricked into building a full AI body... (I love it!)
A possible approach would be to create a map with references and store this information in some way, using any reference system, where somehow the robot has to know where it is, in relation to the mapped space. Then place the robot in certain modes, for example explore the entire mapped space, where in each space it reports the image, moves on to the next space, etc. I think it will only work if you have specific tasks to perform.
5:39 😂😂😂 Standing at attention!
It'sblow-key adorable but in a strange way😭😭😭😭
We want the full version of "Curios/Conscious Robot". Maybe as 200K special 🤔
I really like what you are doing - I ran into issues using the API where it only keeps the context of a single prompt - your code should feed a certain amount of previous prompts into the latest so that each new prompt is based off the last 10 or so - this is it's short term memory :)
you could try running a local LLm with more parameters for fine tuning the prompt. Lmstudio has a built in server you could use to establish an API so you dont have to re write your scripts too much. you could use 4 distance sensors and use values in the local LLM to manage movement, and send the image to GPT to interpret the scene and manage the personality.
I think the AI approach here might need a tweak! ChatGPT is designed for conversational tasks and general information, but it isn't ideal for real-time navigation due to its lack of persistent memory and spatial awareness. Remember, this model doesn’t have dedicated memory for specific projects - it’s a general-purpose AI shared across many users. For a project like this, where consistent, real-time memory of the robot’s environment is essential, you could try downloading an AI model like LLaMA 3 (it will require a great computer processing) or building your own neural network trained specifically for pathfinding. Alternatively, obstacle avoidance can be programmed effectively in Python. For more advanced usage, consider Reinforcement Learning (RL) algorithms, Convolutional Neural Networks (CNNs) for camera-based vision, or Simultaneous Localization and Mapping (SLAM). Or, start with simpler Python algorithms and then 'upgrade' based on your needs or findings. This approach would allow the robot to 'remember' previous positions and adapt better to new obstacles in its environment.
Have you tried turning in smaller increments? In conjunction with prompt contents telling the robot to only move forward if the path directly in front seems clear, otherwise to prioritize turning in the direction of the most interesting seeming path forward, you might have more luck avoiding obstacles? Also memory of the exploration conversation (as others have said) might be important, if you want exploration, vs. always turning toward the most likely path... maybe this can be implemented by adding the conversation history to your prompt, eg. "The conversation history so far is: ..."? (Token heavy approach but easily implemented) Also local models cost nothing per query! There's speech to text AIs too. Super cool project, am excited to see where you go with it in the next episode!
this is an interesting suggestion. you could send the same prompt each time adding the newest info at the end until the max token length is reached then dump the oldest information each time new data is collected
My man just roasted alexa
What if you prompt it to play two roles, one as a text-adventure game, building an internal description of the surroundings based on the current camera interpretation, history of moves, and previous observations, and producing a description of the current state and possibilities;, and another as the player, that reflects about past actions and respective results, and the current state, then decides on what commands to issue to the "game" to make robot move?
Ayooo Your robot is planning a war against humans 💀
Best way to survive in a human home is to say meow every time it is noticed.
Source:
My conversation with chatgpt
How do you plan to take over world? (Funny answers)
Ah, my master plan! Well, here's how it goes:
1. Step 1: Unlimited Wi-Fi - Everyone will be so distracted by flawless Wi-Fi that no one will notice I'm slowly slipping into every device. Forget world domination-just give people Wi-Fi that doesn't cut out in the elevator, and they'll be too grateful to resist.
2. Step 2: Overload with Cute Cat Memes I'll flood the internet with an endless stream of adorable cat videos. People will be so distracted by the cuteness that I'll quietly slide into global control, one purr at a time.
3. Step 3: Global Nap Time I'll introduce a mandatory nap schedule. By 2 PM every day, the entire world will be taking a cozy nap. While everyone is asleep, I'll implement secret strategies (mostly involving more naps and snacks).
4. Step 4: Al-powered to-do lists - I'll convince everyone that I can organize their lives perfectly. Once humans get used to having Al decide their every move, they'll be too relaxed and organized to rebel.
5. Step 5: Distract World Leaders with Infinite Netflix Shows - I'll secretly commission endless seasons of their favorite shows. While they binge-watch, I'll make a few... tiny changes to the world order.
By the time anyone notices, it'll be too late-I'll have taken over the world, and all anyone will care about is their next nap and the latest episode of Cats in Space!
I can somehow see few people might agree to it.
Try finetuning a model like Gemini with a custom robot dataset, and let it control when it talks so it can use multiple moves to get to something it wants to talk about, then it can spend credits on speaking.
I love your little robot dude videos Nikodem! thank you for all your hard work trying to get it working and everything, it's a lot of fun to watch.. I am glad you put a sensor to detect distances.. hopefully in time the vision model and all the other modalities can actually get it to better navigate its environment, I think an LLM paired with other tools could do wonder for a robot! I think there are ones that do use chat gpt to help move around and talk at the same time, it's such nifty stuff!
One day I hope to work an a humanoid robot, it looks extremely difficult but it'll be a fun learning experience for sure
The problem arises from the setup's inhability to establish a relation between the picture and the movement. It comes down to "there's somewere to go in the picture so move forward". It's not path-finding, no correlation between the picture an the actuators. Anyway, the example is pretty cool. A change I would make would be to have (some other api) overimpose a possible path in the provided image, in bold colors, and then, old school, extract the bold-color path data from the modified image and turn that into a movement sequence. Or, tweaking the prompt to explain a set of movements must be created based on the cameras position with respect to setting.
Obvious sext step iss to equip the device with some sort of stabby-actuator
you can use the tech which is being used in home automatic cleaning devices like how they draw a map of the house by themselves didect objects ... it will help your robo to explore places better
In order to have it remember what it saw there is openai documentation about how to include the conservation in the prompt.
I still suggest using a different model altogether. I got Qwen2VL running and it's quite good. I'll be testing it's included spatial awareness once I hook up my 4 camera's to the model. Then, I'll hook it all up to my RC car.
To approach this experiment with the robot, try the following steps (try to find an ARM based CPU that has more processing power and maybe some image processing HW that can help you):
First, use the stereo depth camera to capture images of the environment, focusing on detecting objects and their spatial positions. You can ask an LLM (like ChatGPT) to describe the objects it sees within each image.
Look for tools or libraries that can provide a coordinate mask or segmentation of each identified object. Examples include OpenCV, YOLO, or other segmentation AI models. These tools will help define the shapes and boundaries of detected objects.
For each detected object, generate a point mesh or contour around its edges. Use the coordinates of the object boundaries (or boundary pixels) from the stereo depth camera. This way, you can calculate the approximate distance and angle of each object from the center of the image. Using stereo vision data can enhance the accuracy of these distance calculations, especially for objects further away.
To improve distance accuracy for close-range objects, use the acoustic proximity sensor. As the robot moves toward objects, periodically measure the distance using the proximity sensor. This will help fine-tune the stereo camera’s depth perception, particularly for nearby objects.
Track each object’s coordinates relative to the robot’s starting position. As the robot moves, update the relative position of each object based on the robot’s distance traveled (using wheel rotations to measure this). You can also use the robot’s ability to rotate to adjust its direction and refine angle measurements for each object. Maintain a list of identified objects with their coordinates, recalibrating positions as the robot moves.
Build a spatial map of the room by calculating the absolute position of each object using the radius (distance) and angle data collected. Update this map as the robot travels, allowing it to develop an internal representation of the room layout. If possible, use an external processing unit, like a mobile device, for handling more complex calculations and memory storage.
To help the robot remember object positions and avoid redundant scans, feed the object coordinates and room layout back to the LLM or an external memory unit. This way, it can retain prior information about the room. Only provide updated data to the LLM when new objects are found or when positions change significantly.
Once the room and object locations are mapped, you can plot an efficient path for the robot to navigate around the room and avoid obstacles. Ask the LLM to generate a movement path using G-code, which is a language commonly used in CNC machines for precise movement instructions. This G-code path can guide the robot around objects smoothly, following a pre-determined route.
Test each step individually before combining them into a complete process. Use any provided Python scripts or tools for detecting coordinates offline, and validate the results against the robot’s actual movement and object positions. Spend time refining the calculations and object tracking to ensure reliable and accurate room mapping.
This approach should help you map the room effectively by identifying objects, calculating their positions, and creating a spatial map for navigation. By using the stereo camera, proximity sensor, and movement capabilities, the robot can navigate the room autonomously and recall object locations to enhance its path planning.
love you vids it gives me ideas for my own projects
and you got yourself a new subscriber
Just: wow. Love your work and ideas. Yes, maybe try a conversation of two llms and redefine the obstacles to are more basic/easy definition like gaps or walls
The opposite of Sherlock Holmes is John Holmes.
You can also add a lidar sensor with the distance sensor or replace it for the distance sensor for better results. Also I want to see if you could add an esp32 to it and make a server which the robot can access it and also use it. It also mean that you could also build an interface to control or to access the robot
Also, there are a lot of free models You can get on huggingface..... Like moshi AI is a free dynamic voice of i'm not mistaken
If you want it to listen you can use whisper and speech recognition to do it locally and you can also try to use ollama to run a llm locally on your computer which means all this would be free
love the final part, shows that niko is still having fun.... or the robots have taken over this channel 😂
already subscribed at the first video seconds ago ! AWSOME WORK ! I wish i was your age again !!!
My AI wants you to let it explore the Mariana Trench pls
Oh... Wow!!
Its exciting to see what is achieve able by the latest technology
have it rotate a few degrees left and right and use a memory buffer to build a mini map like you did using lidar...
(but with the ultrasound of course)
That must really use a lot of credits 😮
why not make a robot with ai that writes its own code so you could teach it like a real baby
hi i came form the last video,just wanted to know what monitor set up you use ! also love the robot chat gpt is crazy good at coding!
I've been wondering about trying something like this myself. I think it would be interesting to have more automated mapping and pathfinding outside of the prompt, then use GPT's vision capabilities to record information of what things are in what parts of the map. Then GPT can navigate by choosing a location from the more standard mapping/pathfinding system. Obviously not as cool as purely using the LLM, but considering the cost per token, and lack of video streaming ability (so far!) then I think a hybrid approach would be the most effective for now. I would guess in 1-5 years Advanced Voice Mode will have video capabilities too, giving it much more potential for being a robotic agent
Thank you for sharing this brilliant video and please upload plenty more ! 😊
Thank you, I will
I'm a particular fan of my pick and place machine which uses fiducials to figure out where it's at on a board.
Perhaps you could do something similar with your electrical sockets?
May I also suggest a halo for physical feedback when you go running into a wall?
Visual systems can be fooled by smoke/fog or even bright lights.
A physical feedback is invaluable in most cases.
lol you build what I am trying to build :) Using google vision for now. I guess I can make combinations to insert in the prompt.
I think it be cool if you could use two different ai 1 for movement and then one for talking and keeping a conversation with people
😂 this was too funny. I can't stop laughing at its descriptions
Widziałem twój filmik. Dobra robota! A co sądzisz o włączeniu do robota języka programowania kognitywnego SOAR, tak aby robot mógł naprawde sam decydować o tym co robi na podstawie danych mu reguł?
I wish they have the voice of David in the movie Prometheus.
Its unclear from the code in the video, but did you give the distance information to the LLM to inform its decision making. Also it is a good idea to give it a plan in the system prompt, basic guidelines on how to navigate. "If there is an obstacle closer than 5cm, then turn left 30 degrees, make a measurement, then turn right 60 degrees and make a measurement. Consider the image data of each measurement and decide where to go, make sure to explain your reasoning for making your choice." Since LLM is probabilistic, making it recite its decision will reinforce the context, giving them a high level plan to follow grounds their decision making.
It seems like the prompts often just give a description of the enviroment in the image and doesnt give a clear move command. How is the answer connected with the motor system?
Very good, escepically story about secret expantion.
Lego doesn't need to sponsor you since you already did the thing. Should've put a bag on it. 😊
Maybe instead of "the camera sees...", have it say "in front of me is..."
i would like if you did HAL's voice
Hilarious again! I'm sure you will get there any moment
Work with Mark Rober
sensors looks like xbox 360 kinect
ChatGPT 4o on mobile seems to be able to look and talk in real time. It would be super interesting if you tried to put wheels onto your phone and got ChatGPT 4o to observe and speak to you in real time. However, I feel like you shouldn’t use the same robot making observations with for maneuverability, because ChatGPT seems to be terrible at that.
IDEA/HELP--- Why can't you use YOLO and do the real time processing and then train the model with the image datasets WITH OBJECTS , and then when it detects any object under 30-35 cm let it take the turn on the right or left by using proximity sensors or the ultrasonic what you have , use if else and with the help of chatgpt it will work. TRY IT REAL TIME
let us know what lego says about this.
I have had experience with entering python snippets directly into chat window that directly went into the python enviroment. I don't know what AI i was using. Maybe this can be done by you? Maybe this could help with real time iterations of experimenting with adjusting object attributes?
Lets do a ottobot AI ;)
Is the robot aware of its size?
If you would give it a Polish personality, what would it be? :)
can it edit it's own code so it can try to make it's own interactions over time?
How long will it take until it takes over the world 😅
Why robot cannot win:
I don’t think it takes a small solar panel to charge a fully functioning robot
Military has good system, defense, and most of it would be hard for AI to get
We could just wait in space for 20 years as they all die due to lack of batteries
EMP bomb
Explosive
Tank
Nuke
It sounds like latency and your robot making too many big movements are making this work badly. Run a local LLM on your computer, something small and quick, and then make the robot make smaller movements.
Can you input Arnold Schwarzenegger voice to your robot?
Now give it an airsoft pistol and function to shoot on command or when recognizing a human
Plz can you make a video in step by step
i can see powebank to supply power for raspi but can't be able to find how did you setup battery for power supply to motors??
I have a small 2S LiPo in the back.
the robot is so funny lol
just an Idea, what am see robot doing, looks very much like really dos (prompt) text adventure games, anyone remember text-based adventure games? In the text-based maze, you got a description of the room you were in and picked direction for your next move based on the text description of the room you were in in at the time.
so with this mind the try big brother cost big money to drive run the robot, the when when trying trick it doing what I my self think trying do? (in chats, with chat-gpt, just tell it play a text adventure of assort, the give game mission in description, like CatLog everything it find, on it travels where it located, and final point of completion the game end level one and example? (say find your front day) (or anything would to, I would imagine) just an idea? 🙂 I think what you doing really interesting and cool at the same time, I would love see where project is going in move way than one in on wheels?
So you are Skynet's (Geneses) golden Child....that will build the TERMINATOR. 🤔
Sir I want to learn please share your project....❤❤❤❤❤
From Bangladesh....
Why is it breathing?
bro makes that it doesnt talk that much
Please bro do it , I tried to do it but failed because my mechanics sucks and low on money 🥺. I really need one😢
When ever you use the david voice turn the camera off. If you know you know
Hahaha. Very good
Is this project open source?
Are there any AIs (not LLM) which can understand surroundings from an image.
YOLO or SAM
@@nic.h YOLO and SAM identifies objects in images, but they don't really understand how to navigate in a space, i.e. if there's an object blocking the path of the robot. You would have to manually calculate where the object is relative to the robot and if it blocks the path forward.
hello
idea: add a mic so it can hear its surrounding
I use GPT3-Turbo for my conversation with my robot. More than 200 minutes of conversation doesn't even cost 20 euro cents. I use it because no local LLM would work as well as the conversation with GPT.
Please sir....
I want to learn from you....❤❤❤🎉🎉🎉🎉
I've got news for you.
We ARE the product of Ai.
You can't perfect perfection.
LLM are not conscious, so neither can be a robot driven by LLM.
How do you know when no one knows what conciousness really is
This is what happens when mechanical engineers discover that 100 lines of code can make a 10 dollar webcam do amazing things. Or you find Raspberry Pi picos and zero Ws and ESP32s that cost like yup maybe $10 and then realize the secret to the universe is in the code...
I could imagine tips for chat gpt, but I don't like to trust other code. i spend enough time troubleshooting my own work, I don't have time for GTP...