SQLGlot 是一個無依賴的 SQL 解析器、轉譯器、最佳化器和引擎。它可用於格式化 SQL 或在 24 種不同的方言之間進行轉換,例如 DuckDB、Presto / Trino、Spark / Databricks、Snowflake 和 BigQuery。它的目標是讀取各種 SQL 輸入,並以目標方言輸出語法和語義正確的 SQL。
它是一個非常全面的通用 SQL 解析器,具有強大的測試套件。它的效能也相當高,而且完全是用 Python 寫的。
您可以輕鬆自訂解析器、分析查詢、遍歷表達式樹以及以程式設計方式建立 SQL。
語法錯誤會被突出顯示,並且根據配置可能會警告或引發方言不相容性。然而,SQLGlot 的目標並不是成為 SQL 驗證器,因此它可能無法偵測某些語法錯誤。
在 API 文件和表達式樹入門中了解有關 SQLGlot 的更多資訊。
SQLGlot 非常歡迎貢獻;閱讀貢獻指南和入門文件即可開始!
來自 PyPI:
pip3 install " sqlglot[rs] "
# Without Rust tokenizer (slower):
# pip3 install sqlglot
或透過本地結帳:
make install
開發要求(可選):
make install-dev
給定版本號MAJOR
。 MINOR
。 PATCH
, SQLGlot 使用下列版本控制策略:
PATCH
版本會增加。MINOR
版本會增加。MAJOR
版本會增加。 我們很樂意聽取您的意見。加入我們的社群 Slack 頻道!
我嘗試解析應該有效的 SQL,但失敗了,為什麼會發生這種情況?
parse_one(sql, dialect="spark")
(或: read="spark"
)。如果未指定方言, parse_one
將嘗試根據「SQLGlot 方言」解析查詢,該方言被設計為所有支援方言的超集。如果您嘗試指定方言但仍然不起作用,請提出問題。我嘗試輸出 SQL,但它的方言不正確!
parse_one(sql, dialect="spark").sql(dialect="duckdb")
(或: transpile(sql, read="spark", write="duckdb")
)。我嘗試解析無效的 SQL,它成功了,儘管它應該引發錯誤!為什麼它沒有驗證我的 SQL?
sqlglot.dataframe 發生了什麼事?
輕鬆從一種方言翻譯成另一種方言。例如,日期/時間函數因方言而異,並且可能難以處理:
import sqlglot
sqlglot . transpile ( "SELECT EPOCH_MS(1618088028295)" , read = "duckdb" , write = "hive" )[ 0 ]
' SELECT FROM_UNIXTIME(1618088028295 / POW(10, 3)) '
SQLGlot 甚至可以轉換自訂時間格式:
import sqlglot
sqlglot . transpile ( "SELECT STRFTIME(x, '%y-%-m-%S')" , read = "duckdb" , write = "hive" )[ 0 ]
" SELECT DATE_FORMAT(x, 'yy-M-ss') "
標識符分隔符號和資料類型也可以翻譯:
import sqlglot
# Spark SQL requires backticks (`) for delimited identifiers and uses `FLOAT` over `REAL`
sql = """WITH baz AS (SELECT a, c FROM foo WHERE a = 1) SELECT f.a, b.b, baz.c, CAST("b"."a" AS REAL) d FROM foo f JOIN bar b ON f.a = b.a LEFT JOIN baz ON f.a = baz.a"""
# Translates the query into Spark SQL, formats it, and delimits all of its identifiers
print ( sqlglot . transpile ( sql , write = "spark" , identify = True , pretty = True )[ 0 ])
WITH ` baz ` AS (
SELECT
` a ` ,
` c `
FROM ` foo `
WHERE
` a ` = 1
)
SELECT
` f ` . ` a ` ,
` b ` . ` b ` ,
` baz ` . ` c ` ,
CAST( ` b ` . ` a ` AS FLOAT) AS ` d `
FROM ` foo ` AS ` f `
JOIN ` bar ` AS ` b `
ON ` f ` . ` a ` = ` b ` . ` a `
LEFT JOIN ` baz `
ON ` f ` . ` a ` = ` baz ` . ` a `
評論也會盡力保留:
sql = """
/* multi
line
comment
*/
SELECT
tbl.cola /* comment 1 */ + tbl.colb /* comment 2 */,
CAST(x AS SIGNED), # comment 3
y -- comment 4
FROM
bar /* comment 5 */,
tbl # comment 6
"""
# Note: MySQL-specific comments (`#`) are converted into standard syntax
print ( sqlglot . transpile ( sql , read = 'mysql' , pretty = True )[ 0 ])
/* multi
line
comment
*/
SELECT
tbl . cola /* comment 1 */ + tbl . colb /* comment 2 */ ,
CAST(x AS INT ), /* comment 3 */
y /* comment 4 */
FROM bar /* comment 5 */ , tbl /* comment 6 */
您可以使用表達式助理探索 SQL,以執行諸如在查詢中查找列和表之類的操作:
from sqlglot import parse_one , exp
# print all column references (a and b)
for column in parse_one ( "SELECT a, b + 1 AS c FROM d" ). find_all ( exp . Column ):
print ( column . alias_or_name )
# find all projections in select statements (a and c)
for select in parse_one ( "SELECT a, b + 1 AS c FROM d" ). find_all ( exp . Select ):
for projection in select . expressions :
print ( projection . alias_or_name )
# find all tables (x, y, z)
for table in parse_one ( "SELECT * FROM x JOIN y JOIN z" ). find_all ( exp . Table ):
print ( table . name )
閱讀 ast 入門知識以了解有關 SQLGlot 內部結構的更多資訊。
當解析器偵測到語法錯誤時,它會引發ParseError
:
import sqlglot
sqlglot . transpile ( "SELECT foo FROM (SELECT baz FROM t" )
sqlglot.errors.ParseError: Expecting ). Line 1, Col: 34.
SELECT foo FROM (SELECT baz FROM t
~
結構化語法錯誤可用於程式設計使用:
import sqlglot
try :
sqlglot . transpile ( "SELECT foo FROM (SELECT baz FROM t" )
except sqlglot . errors . ParseError as e :
print ( e . errors )
[{
'description' : 'Expecting )' ,
'line' : 1 ,
'col' : 34 ,
'start_context' : 'SELECT foo FROM (SELECT baz FROM ' ,
'highlight' : 't' ,
'end_context' : '' ,
'into_expression' : None
}]
可能無法在某些方言之間翻譯某些查詢。對於這些情況,SQLGlot 可能會發出警告,並預設繼續進行盡力翻譯:
import sqlglot
sqlglot . transpile ( "SELECT APPROX_DISTINCT(a, 0.1) FROM foo" , read = "presto" , write = "hive" )
APPROX_COUNT_DISTINCT does not support accuracy
' SELECT APPROX_COUNT_DISTINCT(a) FROM foo '
可以透過設定unsupported_level
屬性來變更此行為。例如,我們可以將其設定為RAISE
或IMMEDIATE
以確保引發例外:
import sqlglot
sqlglot . transpile ( "SELECT APPROX_DISTINCT(a, 0.1) FROM foo" , read = "presto" , write = "hive" , unsupported_level = sqlglot . ErrorLevel . RAISE )
sqlglot.errors.UnsupportedError: APPROX_COUNT_DISTINCT does not support accuracy
有些查詢需要準確轉換附加信息,例如其中引用的表的架構。這是因為某些轉換是類型敏感的,這意味著需要類型推斷才能理解它們的語義。儘管qualify
和annotate_types
最佳化器規則可以幫助解決此問題,但預設不會使用它們,因為它們會增加大量開銷和複雜性。
轉譯通常是一個難題,因此 SQLGlot 採用「增量」方法來解決它。這意味著目前可能有一些方言對缺乏對某些輸入的支持,但預計這種情況會隨著時間的推移而改善。我們非常感謝有據可查且經過測試的問題或 PR,因此如果您需要指導,請隨時與我們聯繫!
SQLGlot 支援增量建構 SQL 表達式:
from sqlglot import select , condition
where = condition ( "x=1" ). and_ ( "y=1" )
select ( "*" ). from_ ( "y" ). where ( where ). sql ()
' SELECT * FROM y WHERE x = 1 AND y = 1 '
可以修改解析樹:
from sqlglot import parse_one
parse_one ( "SELECT x FROM y" ). from_ ( "z" ). sql ()
' SELECT x FROM z '
解析表達式也可以透過將映射函數應用於樹中的每個節點來遞歸轉換:
from sqlglot import exp , parse_one
expression_tree = parse_one ( "SELECT a FROM x" )
def transformer ( node ):
if isinstance ( node , exp . Column ) and node . name == "a" :
return parse_one ( "FUN(a)" )
return node
transformed_tree = expression_tree . transform ( transformer )
transformed_tree . sql ()
' SELECT FUN(a) FROM x '
SQLGlot 可以將查詢重寫為「最佳化」形式。它執行多種技術來創建新的規範 AST。此 AST 可用於標準化查詢或為實現實際引擎提供基礎。例如:
import sqlglot
from sqlglot . optimizer import optimize
print (
optimize (
sqlglot . parse_one ( """
SELECT A OR (B OR (C AND D))
FROM x
WHERE Z = date '2021-01-01' + INTERVAL '1' month OR 1 = 0
""" ),
schema = { "x" : { "A" : "INT" , "B" : "INT" , "C" : "INT" , "D" : "INT" , "Z" : "STRING" }}
). sql ( pretty = True )
)
SELECT
(
" x " . " a " <> 0 OR " x " . " b " <> 0 OR " x " . " c " <> 0
)
AND (
" x " . " a " <> 0 OR " x " . " b " <> 0 OR " x " . " d " <> 0
) AS " _col_0 "
FROM " x " AS " x "
WHERE
CAST( " x " . " z " AS DATE ) = CAST( ' 2021-02-01 ' AS DATE )
您可以透過呼叫repr
來查看已解析 SQL 的 AST 版本:
from sqlglot import parse_one
print ( repr ( parse_one ( "SELECT a + 1 AS z" )))
Select (
expressions = [
Alias (
this = Add (
this = Column (
this = Identifier ( this = a , quoted = False )),
expression = Literal ( this = 1 , is_string = False )),
alias = Identifier ( this = z , quoted = False ))])
SQLGlot 可以計算兩個表達式之間的語義差異,並以將來源表達式轉換為目標表達式所需的一系列操作的形式輸出變更:
from sqlglot import diff , parse_one
diff ( parse_one ( "SELECT a + b, c, d" ), parse_one ( "SELECT c, a - b, d" ))
[
Remove ( expression = Add (
this = Column (
this = Identifier ( this = a , quoted = False )),
expression = Column (
this = Identifier ( this = b , quoted = False )))),
Insert ( expression = Sub (
this = Column (
this = Identifier ( this = a , quoted = False )),
expression = Column (
this = Identifier ( this = b , quoted = False )))),
Keep (
source = Column ( this = Identifier ( this = a , quoted = False )),
target = Column ( this = Identifier ( this = a , quoted = False ))),
...
]
另請參閱:SQL 的語義差異。
可以透過子類化Dialect
來添加方言:
from sqlglot import exp
from sqlglot . dialects . dialect import Dialect
from sqlglot . generator import Generator
from sqlglot . tokens import Tokenizer , TokenType
class Custom ( Dialect ):
class Tokenizer ( Tokenizer ):
QUOTES = [ "'" , '"' ]
IDENTIFIERS = [ "`" ]
KEYWORDS = {
** Tokenizer . KEYWORDS ,
"INT64" : TokenType . BIGINT ,
"FLOAT64" : TokenType . DOUBLE ,
}
class Generator ( Generator ):
TRANSFORMS = { exp . Array : lambda self , e : f"[ { self . expressions ( e ) } ]" }
TYPE_MAPPING = {
exp . DataType . Type . TINYINT : "INT64" ,
exp . DataType . Type . SMALLINT : "INT64" ,
exp . DataType . Type . INT : "INT64" ,
exp . DataType . Type . BIGINT : "INT64" ,
exp . DataType . Type . DECIMAL : "NUMERIC" ,
exp . DataType . Type . FLOAT : "FLOAT64" ,
exp . DataType . Type . DOUBLE : "FLOAT64" ,
exp . DataType . Type . BOOLEAN : "BOOL" ,
exp . DataType . Type . TEXT : "STRING" ,
}
print ( Dialect [ "custom" ])
<class '__main__.Custom'>
SQLGlot 能夠解釋 SQL 查詢,其中表格表示為 Python 字典。該引擎不應該很快,但它對於單元測試和跨 Python 物件本機運行 SQL 很有用。此外,該基礎可以輕鬆地與快速計算內核集成,例如 Arrow 和 Pandas。
下面的範例展示了涉及聚合和聯結的查詢的執行:
from sqlglot . executor import execute
tables = {
"sushi" : [
{ "id" : 1 , "price" : 1.0 },
{ "id" : 2 , "price" : 2.0 },
{ "id" : 3 , "price" : 3.0 },
],
"order_items" : [
{ "sushi_id" : 1 , "order_id" : 1 },
{ "sushi_id" : 1 , "order_id" : 1 },
{ "sushi_id" : 2 , "order_id" : 1 },
{ "sushi_id" : 3 , "order_id" : 2 },
],
"orders" : [
{ "id" : 1 , "user_id" : 1 },
{ "id" : 2 , "user_id" : 2 },
],
}
execute (
"""
SELECT
o.user_id,
SUM(s.price) AS price
FROM orders o
JOIN order_items i
ON o.id = i.order_id
JOIN sushi s
ON i.sushi_id = s.id
GROUP BY o.user_id
""" ,
tables = tables
)
user_id price
1 4.0
2 3.0
另請參閱:從頭開始編寫 Python SQL 引擎。
SQLGlot 使用 pdoc 來提供其 API 文件。
SQLGlot 網站上有託管版本,或者您可以使用以下命令在本機上建置:
make docs-serve
make style # Only linter checks
make unit # Only unit tests (or unit-rs, to use the Rust tokenizer)
make test # Unit and integration tests (or test-rs, to use the Rust tokenizer)
make check # Full test suite & linter checks
基準測試在 Python 3.10.12 上運行只需幾秒鐘。
詢問 | sqlglot | sqlglotrs | sqlfluff | sql樹 | sql解析 | moz_sql_解析器 | 氧化矽 |
---|---|---|---|---|---|---|---|
TPCH | 0.00944 (1.0) | 0.00590 (0.625) | 0.32116 (33.98) | 0.00693 (0.734) | 0.02858 (3.025) | 0.03337 (3.532) | 0.00073 (0.077) |
短的 | 0.00065 (1.0) | 0.00044 (0.687) | 0.03511 (53.82) | 0.00049 (0.759) | 0.00163 (2.506) | 0.00234 (3.601) | 0.00005 (0.073) |
長的 | 0.00889 (1.0) | 0.00572 (0.643) | 0.36982 (41.56) | 0.00614 (0.690) | 0.02530 (2.844) | 0.02931 (3.294) | 0.00059 (0.066) |
瘋狂的 | 0.02918 (1.0) | 0.01991 (0.682) | 1.88695 (64.66) | 0.02003 (0.686) | 7.46894 (255.9) | 0.64994 (22.27) | 0.00327 (0.112) |
SQLGlot 使用 dateutil 來簡化文字 timedelta 表達式。如果找不到模組,優化器將不會簡化如下表達式:
x + interval ' 1 ' month