LLM4Decompile下載 - LLM4Decompile原始碼下載

LLM4Decompile

其他源碼

1.0.0

下載

LLM4反編譯

結果 | ？型號| 快速入門 | HumanEval-反編譯 | ？引文| 紙| 科拉布 | ▶️ Youtube

逆向工程：使用大型語言模型反編譯二進位程式碼

更新

[2024-10-17]：發布 decompile-ghidra-100k，100k 訓練樣本的子集（每個最佳化等級 25k）。我們提供了一個訓練腳本，在單一 A100 40G GPU 上運行約 3.5 小時。它實現了 0.26 的重複執行率，快速複製 LLM4Decompile 的總成本不到 20 美元。
[2024-09-26]：更新 Colab 筆記本以示範 LLM4Decompile 模型的用法，包括 LLM4Decompile-End 和 LLM4Decompile-Ref 模型的範例。
[2024-09-23]：發布LLM4Decompile-9B-v2，基於Yi-Coder-9B進行微調，在反編譯基準上實現了0.6494的可重執行率。
[2024-06-19]：發布V2系列（LLM4Decompile-Ref）。 V2 (1.3B-22B) 基於Ghidra構建，接受了 20 億個令牌的訓練，以改進Ghidra 的反編譯偽代碼。 22B-V2 版本的效能比 6.7B-V1.5 多出 40.1%。請檢查 ghidra 資料夾以了解詳細資訊。
[2024-05-13]：發布V1.5系列（LLM4Decompile-End，使用LLM直接反編譯二進位）。 V1.5 使用更大的資料集（15B token）和最大 token長度 4,096進行訓練，與先前的模型相比具有顯著的效能（超過100% 的改進）。
[2024-03-16]：添加llm4decompile-6.7b-uo 模型，該模型在沒有先驗知識優化等級（O0~O3）的情況下進行訓練，平均可重新執行性約為0.219，在我們的模型中表現最好。

關於

LLM4Decompile是致力於反編譯的開創性開源大型語言模型。其目前版本支援將 Linux x86_64 二進位（從 GCC 的 O0 到 O3 最佳化等級）反編譯為人類可讀的 C 原始碼。我們的團隊致力於擴展該工具的功能，並不斷努力納入更廣泛的架構和配置。
LLM4Decompile-End專注於直接反編譯二進位檔。 LLM4Decompile-Ref細化了 Ghidra 反編譯的偽代碼。

評估

框架

在編譯期間，預處理器處理原始程式碼 (SRC) 以消除註解並擴展巨集或包含。然後，清理後的程式碼被轉送到編譯器，編譯器將其轉換為彙編程式碼 (ASM)。此 ASM 由彙編器轉換為二進位代碼（0 和 1）。連結器透過連結函數呼叫來建立可執行檔來完成該過程。另一方面，反編譯涉及將二進位程式碼轉換回原始檔。受過文字訓練的法學碩士缺乏直接處理二進位資料的能力。因此，二進位檔案必須先被Objdump反彙編為組合語言（ASM）。應該注意的是，二進制和反彙編的 ASM 是等效的，它們可以相互轉換，因此我們可以互換使用它們。最後，計算反編譯程式碼和原始程式碼之間的損失以指導訓練。為了評估反編譯程式碼 (SRC') 的質量，透過測試斷言（可重新執行性）對其功能進行測試。

指標

可重執行性評估反編譯後的程式碼是否可以正確執行並通過所有預先定義的測試案例。

基準測試

HumanEval-Decompile完全依賴標準C 函式庫的 164 個 C 函數的集合。
ExeBench從真實項目中提取的 2,621 個函數的集合，每個函數都利用使用者定義的函數、結構和巨集。

結果

型號

我們的 LLM4Decompile 包含參數大小在 13 億到 330 億之間的模型，並且我們已在 Hugging Face 上提供這些模型。

模型	檢查站	尺寸	可重複執行性	筆記
llm4decompile-1.3b-v1.5	？高頻鏈路	1.3B	27.3%	註3
llm4decompile-6.7b-v1.5	？高頻鏈路	6.7B	45.4%	註3
llm4decompile-1.3b-v2	？高頻鏈路	1.3B	46.0%	註4
llm4decompile-6.7b-v2	？高頻鏈路	6.7B	52.7%	註4
llm4decompile-9b-v2	？高頻鏈路	9B	64.9%	註4
llm4decompile-22b-v2	？高頻鏈路	22B	63.6%	註4

註3：V1.5系列使用更大的資料集（15B token）和最大token大小4,096進行訓練，與之前的模型相比，具有顯著的性能（超過100％的改進）。

註 4：V2 系列基於Ghidra構建，並在 20 億個代幣上進行訓練，以細化Ghidra 反編譯的偽代碼。檢查 ghidra 資料夾以獲取詳細資訊。

快速入門

安裝：請使用下面的腳本安裝必要的環境。

 git clone https://github.com/albertan017/LLM4Decompile.git
cd LLM4Decompile
conda create -n 'llm4decompile' python=3.9 -y
conda activate llm4decompile
pip install -r requirements.txt

以下是如何使用我們的模型的範例（針對 V1.5 進行了修訂。對於先前的模型，請在 HF 上查看相應的模型頁面）。注意：將“func0”替換為你要反編譯的函數名。

預處理：將C程式碼編譯為二進位，並將二進位反組譯為組譯指令。

 import subprocess
import os
func_name = 'func0'
OPT = [ "O0" , "O1" , "O2" , "O3" ]
fileName = 'samples/sample' #'path/to/file'
for opt_state in OPT :
    output_file = fileName + '_' + opt_state
    input_file = fileName + '.c'
    compile_command = f'gcc -o { output_file } .o { input_file } - { opt_state } -lm' #compile the code with GCC on Linux
    subprocess . run ( compile_command , shell = True , check = True )
    compile_command = f'objdump -d { output_file } .o > { output_file } .s' #disassemble the binary file into assembly instructions
    subprocess . run ( compile_command , shell = True , check = True )
    
    input_asm = ''
    with open ( output_file + '.s' ) as f : #asm file
        asm = f . read ()
        if '<' + func_name + '>:' not in asm : #IMPORTANT replace func0 with the function name
            raise ValueError ( "compile fails" )
        asm = '<' + func_name + '>:' + asm . split ( '<' + func_name + '>:' )[ - 1 ]. split ( ' n n ' )[ 0 ] #IMPORTANT replace func0 with the function name
        asm_clean = ""
        asm_sp = asm . split ( " n " )
        for tmp in asm_sp :
            if len ( tmp . split ( " t " )) < 3 and '00' in tmp :
                continue
            idx = min (
                len ( tmp . split ( " t " )) - 1 , 2
            )
            tmp_asm = " t " . join ( tmp . split ( " t " )[ idx :])  # remove the binary code
            tmp_asm = tmp_asm . split ( "#" )[ 0 ]. strip ()  # remove the comments
            asm_clean += tmp_asm + " n "
    input_asm = asm_clean . strip ()
    before = f"# This is the assembly code: n " #prompt
    after = " n # What is the source code? n " #prompt
    input_asm_prompt = before + input_asm . strip () + after
    with open ( fileName + '_' + opt_state + '.asm' , 'w' , encoding = 'utf-8' ) as f :
        f . write ( input_asm_prompt )

組裝說明應採用以下格式：

<FUNCTION_NAME>:n操作n操作n

典型的組裝指令可能如下所示：

 <func0>:
endbr64
lea    (%rdi,%rsi,1),%eax
retq

反編譯：使用LLM4Decompile將組譯指令翻譯成C：

 from transformers import AutoTokenizer , AutoModelForCausalLM
import torch

model_path = 'LLM4Binary/llm4decompile-6.7b-v1.5' # V1.5 Model
tokenizer = AutoTokenizer . from_pretrained ( model_path )
model = AutoModelForCausalLM . from_pretrained ( model_path , torch_dtype = torch . bfloat16 ). cuda ()

with open ( fileName + '_' + OPT [ 0 ] + '.asm' , 'r' ) as f : #optimization level O0
    asm_func = f . read ()
inputs = tokenizer ( asm_func , return_tensors = "pt" ). to ( model . device )
with torch . no_grad ():
    outputs = model . generate ( ** inputs , max_new_tokens = 2048 ) ### max length to 4096, max new tokens should be below the range
c_func_decompile = tokenizer . decode ( outputs [ 0 ][ len ( inputs [ 0 ]): - 1 ])

with open ( fileName + '.c' , 'r' ) as f : #original file
    func = f . read ()

print ( f'original function: n { func } ' ) # Note we only decompile one function, where the original file may contain multiple functions
print ( f'decompiled function: n { c_func_decompile } ' )

HumanEval-反編譯

資料儲存在llm4decompile/decompile-eval/decompile-eval-executable-gcc-obj.json中，使用 JSON 清單格式。有 164*4（O0、O1、O2、O3）個樣本，每個樣本有 5 個按鍵：

task_id ：表示問題的ID。
type ：最佳化階段，是[O0, O1, O2, O3]之一。
c_func ：HumanEval 問題的 C 解。
c_test ：C 測試斷言。
input_asm_prompt ：帶有提示的彙編指令，可以像我們的預處理範例一樣匯出。

請檢查評估腳本。

進行中

具有清潔過程的更大的訓練資料集。（完成：2024年5月13日）
支援流行的語言/平台和設定。
支援可執行二進位。（完成：2024年5月13日）
與反編譯工具整合（例如 Ghidra、Rizin）

執照

此程式碼儲存庫已根據 MIT 和 DeepSeek 許可證獲得許可。

引文

 @misc{tan2024llm4decompile,
      title={LLM4Decompile: Decompiling Binary Code with Large Language Models}, 
      author={Hanzhuo Tan and Qi Luo and Jing Li and Yuqun Zhang},
      year={2024},
      eprint={2403.05286},
      archivePrefix={arXiv},
      primaryClass={cs.PL}
}

明星歷史

展開

附加信息

版本 1.0.0
類型其他源碼
更新時間 2024-12-02
大小 9MB
來自於 Github

相關應用

waymo open dataset

2024-11-18
SmartTube

2024-12-14
Sunamu

2024-12-14
MySchedule.py

2024-12-15
viptools for eslam

2024-12-15
VITAident

2024-12-15

爲您推薦

chat.petals.dev

其他源碼

1.0.0
GPT Prompt Templates

其他源碼

1.0.0
GPTyped

其他源碼

GPTyped 1.0.5
waymo open dataset

其他源碼

December 2023 Update
SmartTube

其他源碼

24.71 Stable
Sunamu

其他源碼

Release 2.2.0
waymo open dataset

其他源碼

December 2023 Update
wp functions

其他類別

1.0.0
termwind

其他類別

v2.3.0

相關資訊全部