@@cognibuild Actually I want to ask a question. After my safetensor became gguf, I tried some question, but the model answer seems worse a lot than when it was in safetensor. do you know why?
@@gostjoke a lot of times it has to do with the parameters. Check the chatml mode. Also try using KoboldCPP (kcpp) for your gguf files to see if they work
Hey - before I ask a question, I found your videos on youtube last week - outstanding content - thank you so much, massive shout out. You helped me get some local AI up and running for work and just got major shoutouts at work - thank you so much! How do we convert a already quantized version of a llama 8B to a .gguf file? I keep getting a tensor issue!
i found this on github which might help. You'll need to install and use it in Linux or WSL: github.com/kevkid/gguf_gui Just be certain to pip install: pip install llama-cpp-python because its not added in the installation. Let me know if it works for you!
Yes this is normal stuff .. but you may not realize that you can open gguf with the transformers library !! .. Hence you can use save to pretrained to unquantize the gguf file back to safe tensors !!
@@cognibuild they are not my friend : i thought that but the numbers are not lost the model is just in perminant 4-bit mode: so : tokenizer = AutoTokenizer.from_pretrained(model_id, gguf_file=filename) model = AutoModelForCausalLM.from_pretrained(model_id, gguf_file=filename) print('Extract and Convert to FP16...') model.to(torch.float16) model in this way transformers loads the model as normal(in 4 bit pretrained) then you can save like normal : i was searching for this in the begining but when i could not find it i gave up: but i fed my model all of huggingface docs: so when i was talking about gguf it told me that it was possible so i found it in the docs::
really i could not beleive it so i tried it !!! techniically gguf is just another form of zip!! but for tensors: it converts the model into a llama clone but it remains the mistral inside : technically its only a wrapper for a q4 etc .... yes the tensor sizes are changed but the calculation to compress is the same to decompress ... so it can unzip again ? when you compress the model ie 7b it turns into 3.5b ? so did it shrink .... but the unsloth uses 4-bit models ? so we use quantized loras? .... so there should be no problem once the model is loaded !
Right transfer it back to the format makes more sense. Because if you cut off decimals those decimals are gone. Which is why you're saying it stayed in the quantized size but is now able to run by safe tensors. Cool man!
@@cognibuild i actually discovered it today bro! so i thought i would share... as gguf locks it for transport ... so you can unlock it ... but as you say i think there will be some loss on Q4 and the harsh qunatizes. i always train in 4 bit to make sure when i quantize the model after its basically the same as it was in training: but : if i was to use it for transporting , i would probably do a Q8 or even fp16 gguf... just to make sure .... (this is something quite hidden as you know it can be done but not the syntax) ... as you choose the folder location or repo location you also need to specify the filename .... (wow) ... or you can even just specify the full path of the filename with the kwag handle... (wow).... (but its still better to run them with llama_cpp for it speed ( on laptop or pc transformers runs a bit slow) but llamaCpp runs fast .. so on laptop if you have to use weight use pipelines as its also much faster for some reason ! (today i actually conquered the stable audio (local)) ... with minor adjustments to thier code to recompile it for local use and not repo (quite easy in the end).. now i can do the sound generation ... .=im still using blip1 for image captioning etc ...(to learn the craft) ... for me i have been concentration on getting media IN first all outputs lead to text but now sound also (speech and noises)... (really enjoyable stuff bro... perhaps you should do a few tutorials...)....
Quick and painless. Also like your vibe man. You got yourself a subscriber
@@LeFinesseGod glad to have you
thanks bro you are my god
@@gostjoke I'm not a god. I'm just a dude 😎
@@cognibuild Actually I want to ask a question. After my safetensor became gguf, I tried some question, but the model answer seems worse a lot than when it was in safetensor. do you know why?
@@gostjoke a lot of times it has to do with the parameters. Check the chatml mode.
Also try using KoboldCPP (kcpp) for your gguf files to see if they work
@@cognibuild got it, thanks
Hey - before I ask a question, I found your videos on youtube last week - outstanding content - thank you so much, massive shout out. You helped me get some local AI up and running for work and just got major shoutouts at work - thank you so much!
How do we convert a already quantized version of a llama 8B to a .gguf file? I keep getting a tensor issue!
i found this on github which might help. You'll need to install and use it in Linux or WSL:
github.com/kevkid/gguf_gui
Just be certain to pip install: pip install llama-cpp-python
because its not added in the installation.
Let me know if it works for you!
exelente bro
Yes this is normal stuff .. but you may not realize that you can open gguf with the transformers library !! ..
Hence you can use save to pretrained to unquantize the gguf file back to safe tensors !!
How would you unquantize something? The numbers are lost
@@cognibuild they are not my friend : i thought that but the numbers are not lost the model is just in perminant 4-bit mode:
so :
tokenizer = AutoTokenizer.from_pretrained(model_id, gguf_file=filename)
model = AutoModelForCausalLM.from_pretrained(model_id, gguf_file=filename)
print('Extract and Convert to FP16...')
model.to(torch.float16)
model
in this way transformers loads the model as normal(in 4 bit pretrained) then you can save like normal :
i was searching for this in the begining but when i could not find it i gave up:
but i fed my model all of huggingface docs:
so when i was talking about gguf it told me that it was possible so i found it in the docs::
really i could not beleive it so i tried it !!! techniically gguf is just another form of zip!!
but for tensors: it converts the model into a llama clone but it remains the mistral inside : technically its only a wrapper for a q4 etc .... yes the tensor sizes are changed but the calculation to compress is the same to decompress ... so it can unzip again ?
when you compress the model ie 7b it turns into 3.5b ? so did it shrink .... but the unsloth uses 4-bit models ? so we use quantized loras? ....
so there should be no problem once the model is loaded !
Right transfer it back to the format makes more sense. Because if you cut off decimals those decimals are gone. Which is why you're saying it stayed in the quantized size but is now able to run by safe tensors. Cool man!
@@cognibuild i actually discovered it today bro! so i thought i would share...
as gguf locks it for transport ... so you can unlock it ... but as you say i think there will be some loss on Q4 and the harsh qunatizes. i always train in 4 bit to make sure when i quantize the model after its basically the same as it was in training:
but : if i was to use it for transporting , i would probably do a Q8 or even fp16 gguf...
just to make sure .... (this is something quite hidden as you know it can be done but not the syntax) ... as you choose the folder location or repo location you also need to specify the filename .... (wow) ... or you can even just specify the full path of the filename with the kwag handle... (wow)....
(but its still better to run them with llama_cpp for it speed ( on laptop or pc transformers runs a bit slow) but llamaCpp runs fast .. so on laptop if you have to use weight use pipelines as its also much faster for some reason !
(today i actually conquered the stable audio (local)) ... with minor adjustments to thier code to recompile it for local use and not repo (quite easy in the end).. now i can do the sound generation ... .=im still using blip1 for image captioning etc ...(to learn the craft) ... for me i have been concentration on getting media IN first all outputs lead to text but now sound also (speech and noises)... (really enjoyable stuff bro... perhaps you should do a few tutorials...)....
What if I don't have llama cpp and I wanna run my model in jan?
in jan?
what is your pc system bro share it
@@cikokid asus pro art x670e motherboard, and ryzen 9 7950x, 128 ddr, Nvidia 4090
I always wonder why the hell peopel do coding tutorial in a video?
@@alwekalanet885 I didn't know... ask everyone else who appreciates it