• 5 Posts
  • 207 Comments
Joined 2 years ago
cake
Cake day: June 28th, 2023

help-circle



  • Thanks for the advice. I’ll see how mutch i can squeeze out of the new rig. Especially with exl models and different frameworks.

    Gemma 12B is really popular now

    I was already eyeing it. But i remember the context being memory greedy due to being a multimodal model. While Qwen3 was just way out of the steam deck’s capabilities. Now it’s just a matter of assembling the rig and get tinkering.

    Thanks again for the time and the availability :-)




  • at the moment i’m essentially lab ratting the models, i just love to see how far i can push them, both in parameters and in compexity of request. before they break down. plus it was a good excuse to expand my little “homelab” (read: workbench that’s also stuffed with old computers) form just a raspberry pi to something more beefy. as for more “practical” (still mostly to mess around) purposes. i was thinking about making a pseudo-realistic digital radio w/ announcer, using a small model and a TTS model: that is, writing a small summary for the songs in my playlists (or maybe letting the model itself do it, if i manage to give it search capabilites), and letting them shuffle, using the LLM+TTS combo to fake an announcer introducing the songs. i’m quite sure there was already a similar project floating around on github. another option would be implementing it in home assistant via something like willow as a frontend. to have something closer to commercial assistants like alexa, but fully controlled by the user.

    I’ve been following this comm for a bit and there seems like a real committed, knowledgeable base of folks here - the dialog just in this post almost brings a tear to my eye, lol.

    to be honest, this post might have been the most positive interaction i’ve had on the web since the bbs days. i guess the fact the communities are smaller makes it easier to cobble up people that are genuinely interested in sharing and learing about this stuff, same with the homelab community. like comparing a local coffee shop to a starbucks, it just by nature filters for different people :-)


  • right now i’m hopping between nemo finetunes to see how they fare. i think i only ever used one 8B model from Llama2, the rest is been all Llama 3 and maybe some solar based ones. unfortunately i have yet to properly dig into the more technical side of llms due to time contraints.

    the process is vram light (albeit time intense)

    so long as it’s not interactive i can always run it at night and make it shut off the rig when it’s done. power here is cheaper at night anyways :-)

    thanks for the info (and sorry for the late response, work + cramming for exams turned out to be more brutal than expected)



  • Did you say you’re using a x1 riser though? That splits it to a sixteenth of the bandwidth—maybe I’m misunderstanding what you mean by x1.

    not exactly, what i mean by x1 riser is one of these bad boys they are basically extension cords for a x1 pcie link, no bifurcation. the thinkcenter has 1 x16 slot and two x1 slots. my idea for the whole setup was to have the 3060 i’m getting now into the x16 slot of the motherboard, so it can be used for other tasks as well if need’s be; while the second 3060 would be placed in one of the x1 slots the motherboard has via the riser; since from what i managed to read it should only affect the time to first load the model. but the fact you only mentioned the x16 slot does make me worry if there is some handicap to the other two x1 slots.

    of course, the second card will come down the line; don’t have nearly enough money for two cards and the thinkcentre :-P.

    started with my decade-old ThinkPad inferencing Llama 3.1 8B at about 1 TPS

    pretty mutch same story, but with the optiplex and the steam deck. come to think of it, i do need to polish and share the scripts i wrote for the steam deck, since i designed them to be used without a dock, they’re a wonderful gateway drug to this hobby :-).

    there’s a popular way to squeeze performance through Mixture of Experts (MoE) models.

    yeah, that’s a little too out of scope for me, i’m more practical with the hardware side of things, mostly due to lacking hardware to really get into the more involved stuff. though it’s not out of question for the future :-).

    Tesla P100 16GB

    i am somewhat familiar with these bad boys, we have an older poweredge server full of them at work, where it’s used for fluid simulation, (i’d love to see how it’s set up, but can’t risk bricking the workhorse) but the need to figure out a cooling system for these cards, plus the higher power draw made it not really feasible in my budget unfortunately.


  • Is bifurcation necessary because of how CUDA works, or because of bandwidth restraints? Mostly asking because for the secondary card i’ll be limited by the x1 link mining risers have (and also because unfortunately both machines lack that capability. :'-) )

    Also, if i offload layers to the GPU manually, so that only the context needs to overflow into RAM, will that be less of a slowdown, or will iti comparable to letting model layers into ram? (Sorry for the question bombing, i’m trying understand how mutch i can realistically push the setup before i pull the trigger)


  • You need a 15$ electrical relay board that sends power from the motherboard to the second PSU or it won’t work.

    If you are talking about something like the add2psu boards that jump the PS_ON line of the secondary power supply on when the 12v line of the primary one is ready. Then i’m already on it the diy way. Thanks for the heads up though :-).

    expect 1-5token per second (really more like 2-3).

    5 tokens per seconds would be wonderful compared to what i’m using right now, since it averages at ~ 1,5 tok/s with 13B models. (Koboldcpp through vulkan on a steam deck) My main concerns for upgrading are bigger context/models + trying to speed up prompt processing. But i feel like the last one will also be handicapped by offloading to RAM.

    How much vram is the 3060 youre looking at?

    I’m looking for the 12GB version. i’m also giving myself space to add another one (most likely through a 1x mining riser) if i manage to save up enough another card in the future to bump it up to 24 gb with parallel processing, though i doubt i’ll manage.

    Sorry for the wall of text, and thanks for the help.