• Bloefz@lemmy.world
    link
    fedilink
    English
    arrow-up
    22
    arrow-down
    3
    ·
    13 days ago

    I work with AI and use it personally, but I have my own servers running local models which solves tons of privacy concerns. The inaccuracy is another problem but not a big one for me as I know it and will simply fact check. Also, I don’t really use it for knowledge anyway. Just to filter news to my interest, help with summaries and translation etc.

    People use AI as some all-knowing oracle but an LLM is not meant for that at all.

    • Ex Nummis@lemmy.world
      link
      fedilink
      English
      arrow-up
      8
      ·
      13 days ago

      This is the correct way to use it. In a field you are already very knowledgeable in, so you can do your own fact-checking. This is absolutely paramount. But most people are content to just copy-paste and don’t even ask the llm for sources.

      • Bloefz@lemmy.world
        link
        fedilink
        English
        arrow-up
        4
        ·
        13 days ago

        I have one server with a cheap MI50 instinct. Those come for really cheap on eBay. And it’s got really good memory bandwidth with HBM2. They worked ok with ollama until recently when they dropped support for some weird reason but a lot of other software still works fine. Also older models work fine on old ollama.

        The other one runs an RTX 3060 12GB. I use this for models that only work on nvidia like whisper speech recognition.

        I tend to use the same models for everything so I don’t have the delay of loading the model. Mainly uncensored ones so it doesn’t choke when someone says something slightly sexual. I’m in some very open communities so standard models are pretty useless with all their prudeness.

        For frontend i use OpenWebUI and i also run stuff directly against the models like scripts.

          • Bloefz@lemmy.world
            link
            fedilink
            English
            arrow-up
            2
            ·
            13 days ago

            Agreed. The way they just dumped support for my card in some update with some vague reason also irked me (we need a newer rocm they said but my card works fine with all current rocm versions)

            Also the way they’re now trying to sell cloud AI means their original local service is in competition to the product they sell.

            I’m looking to use something new but I don’t know what yet.

            • brucethemoose@lemmy.world
              link
              fedilink
              English
              arrow-up
              2
              ·
              13 days ago

              I’ll save you the searching!

              For max speed when making parallel calls, vllm: https://hub.docker.com/r/btbtyler09/vllm-rocm-gcn5

              Generally, the built in llama.cpp server is the best for GGUF models! It has a great built in web UI as well.

              For a more one-click RP focused UI, and API server, kobold.cpp rocm is sublime: https://github.com/YellowRoseCx/koboldcpp-rocm/

              If you are running big MoE models that need some CPU offloading, check out ik_llama.cpp. It’s specifically optimized for MoE hybrid inference, but the caveat is that its vulkan backend isn’t well tested. They will fix issues if you find any, though: https://github.com/ikawrakow/ik_llama.cpp/

              mlc-llm also has a Vulcan runtime, but it’s one of the more… exotic LLM backends out there. I’d try the other ones first.

              • Bloefz@lemmy.world
                link
                fedilink
                English
                arrow-up
                1
                ·
                13 days ago

                Thank you so much!! I have been putting it off because what I have works but a time will soon come when I’ll want to test new models.

                I’m looking for a server but not many parallel calls because I would like to use as much context as I can. When making space for e.g. 4 threads, the context is split and thus 4x as small. With llama 3.1 8b I managed to get 47104 context on the 16GB card (though actually using that much is pretty slow). That’s with KV quant to 8b too. But sometimes I just need that much.

                I’ve never tried the llama.cpp directly, thanks for the tip!

                Kobold sounds good too but I have some scripts talking to it directly. I’ll read up on that too see if it can do that. I don’t have time now but I’ll do it in the coming days. Thank you!

                • brucethemoose@lemmy.world
                  link
                  fedilink
                  English
                  arrow-up
                  1
                  ·
                  13 days ago

                  Vllm is a bit better with parallelization. All the kv cache sits in a single “pool”, and it uses as many slots as will fit. If it gets a bunch of short requests, it does many in parallel. If it gets a long context request, it kinda just does that one.

                  You still have to specify a maximum context though, and it is best to set that as low as possible.

                  …The catch is it’s quite vram inefficient. But it can split over multiple cards reasonably well, better than llama.cpp can, depending on your PCIe speeds.

                  You might try TabbyAPI exl2s as well. It’s very good with parallel calls, thoughts I’m not sure how well it supports MI50s.


                  Another thing to tweak is batch size. If you are actually making a bunch of 47K context calls, you can increase the prompt processing batch size a ton to load the MI50 better, and get it to process the prompt faster.


                  EDIT: Also, now that I think about it, I’m pretty sure ollama is really dumb with parallelization. Does it even support paged attention batching?

                  The llama.cpp server should be much better, eg use less VRAM for each of the “slots” it can utilize.

      • brucethemoose@lemmy.world
        link
        fedilink
        English
        arrow-up
        2
        ·
        edit-2
        13 days ago

        Bloefz has a great setup. Used Mi50s are cheap.

        An RTX 3090 + a cheap HEDT/Server CPU is another popular homelab config. Newer models run reasonably quickly on them, with the attention/dense layers on the GPU and sparse parts on the CPU.

    • Clanket@lemmy.world
      link
      fedilink
      English
      arrow-up
      4
      ·
      13 days ago

      How do you know it’s doing any of this correctly, especially filtering and translations?

      • Bloefz@lemmy.world
        link
        fedilink
        English
        arrow-up
        3
        ·
        13 days ago

        I mainly use it for Spanish which I have a basic proficiency in. It just accompanies me on my learning journey. It may be wrong sometime but not often. Like the other reply said, LLMs are good at languages, it’s what they were originally designed for until people found out they could do more (but not quite as well).

        And as for filtering, I just use it as a news feed sanitizer with a whole bunch of rules. It will miss things sometimes but it’s also my ruleset that’s not perfect. I often come across the unfiltered sources anyway and even if it misses something, it’s only news. Nothing really important to me.

      • MagicShel@lemmy.zip
        link
        fedilink
        English
        arrow-up
        2
        arrow-down
        1
        ·
        13 days ago

        Not OP, but…

        It’s not always perfect, but it’s good for getting a tldr to see if maybe something is worth reading further. As for translations, it’s something AI is rather decent at. And if I go from understanding 0% to 95%, really only missing some cultural context about why a certain phrase might mean something different from face value, that’s a win.

        You can do a lot with AI where the cost of it not being exactly right is essentially zero. Plus, it’s not like humans have a great track record for accuracy, come to think of it. It comes down to being skeptical about it like you would any other source.

  • plz1@lemmy.world
    link
    fedilink
    English
    arrow-up
    13
    ·
    13 days ago

    This has a strong whiff of the former Facebook engineers that forbade their families from using the platforms they built.