token.place Download - token.place Source code download

token.place

AI Source Code

1.0.0

Download

token.place

p2p generative AI platform

vision

There are tons of personal computers and homelabs out there with lots of compute that remain idle. This project aims to create a marketplace of people with spare compute and people with needs for compute. Note that this is not a financial marketplace -- this is intended to be a public good. If it takes off is anyone's guess, but I'll donate whatever compute I can in the meantime once this is up and running.

roadmap

installation

virtual environment

create a virtual environment:

$ python -m venv env

Depending on your environment, you may need to replace python in the above command with python3.

On Debian-based distributions, you may additionally need to install venv first, if it's not already installed:

apt install python3.12-venv

activate the virtual environment:

unix/linux/macos

source env/bin/activate

windows

.envScriptsactivate

If this command doesn't work (e.g. Activate.ps1 cannot be loaded because running scripts is disabled on this system), you may have to run the following command in an Administrator PowerShell session:

Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser

install dependencies

You may be missing some CMAKE dependencies

on Debian, you can run the following

sudo apt-get install build-essential cmake

TODO: instructions for other common OSes

then, run:

pip install -r requirements.txt

hardware acceleration

If you want to also utilize your GPU (instead of just your CPU), follow these platform-specific instructions.

windows

This is the resource I used to get things finally working: https://medium.com/@piyushbatra1999/installing-llama-cpp-python-with-nvidia-gpu-acceleration-on-windows-a-short-guide-0dfac475002d

Summarizing:

Prerequisites

Visual Studio with
- C++ CMake tools for windows
- C++ core features
- Windows 10/11 SDK
CUDA Toolkit

The next steps need to be executed in the same virtual environment you set up above. You'll see something like (env) on the bottom left in your terminal (may not be true on all platforms in all terminals).

This will replace the llama-cpp-python you installed via pip install -r requirements.txt and will instruct it to use cuBLAS.

if you're using Command Prompt

set CMAKE_ARGS=-DLLAMA_CUBLAS=on
set FORCE_CMAKE=1
pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir --verbose

if you're using Powershell

$env:CMAKE_ARGS = "-DLLAMA_CUBLAS=on"
$env:FORCE_CMAKE=1
pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir --verbose

when you run server.py next, you'll see BLAS = 1 in a collection of lines that looks like this:

AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 |

This indicates that server.py can correctly access your GPU resources.

llama_cpp_python is initialized like this:

llm = Llama(
        model_path=model_path,
        n_gpu_layers=-1,
        n_ctx=8192,
        chat_format="llama-3"
    )

n_gpu_layers instructs llama to use as much of your GPU resources as possible.

macos

TODO

unix/linux

TODO

Running the servers

Dev-like environment

You'll need a way to switch between terminal tabs (e.g. tmux, VS Code terminal tabs).

Launch the relay, which runs on http://localhost:5000:

python relay.py

Then, in a separate terminal tab, launch the server, which runs on http://localhost:3000:

python server.py

NOTE: When first launched, or if the model file isn't present (currently only Llama 7B Chat GGUF by TheBloke), the script will download the model (approximately 4GB) and will save it in the models/ directory in your project directory under the same filename. This will be gated by user interaction in the future to prevent large file downloads without the user's consent. Eventually you'll basically browse models and choose one from a list.

relay.py acts as a proxy between the client (including, but not limited to, this repo's client.py) and server.py, obfuscating each other's public IP from each other, solving one of the big limitations of P2P networks (e.g. for .torrents). In a future version, relay.py will not see the contents of the conversation between server and client thanks to end-to-end encryption. Anyone can fork this project and run your own relay, which have compute provided by various server.pys running on various consumer hardware.

You can test things out using the simple command-line client, client.py:

python client.py

Type your message when prompted and press Enter. All of this is now happening on your local hardware, thanks to llama_cpp_python, a binding for llama.cpp.

To exit, press Ctrl+C/Cmd+C.

Alternatively, you can visit http://localhost:5000 in your browser.