SQLGlot 是一个无依赖的 SQL 解析器、转译器、优化器和引擎。它可用于格式化 SQL 或在 24 种不同的方言之间进行转换,例如 DuckDB、Presto / Trino、Spark / Databricks、Snowflake 和 BigQuery。它的目标是读取各种 SQL 输入,并以目标方言输出语法和语义正确的 SQL。
它是一个非常全面的通用 SQL 解析器,具有强大的测试套件。它的性能也相当高,而且完全是用 Python 编写的。
您可以轻松自定义解析器、分析查询、遍历表达式树以及以编程方式构建 SQL。
语法错误会突出显示,并且根据配置可能会警告或引发方言不兼容性。然而,SQLGlot 的目标并不是成为 SQL 验证器,因此它可能无法检测某些语法错误。
在 API 文档和表达式树入门中了解有关 SQLGlot 的更多信息。
SQLGlot 非常欢迎贡献;阅读贡献指南和入门文档即可开始!
来自 PyPI:
pip3 install " sqlglot[rs] "
# Without Rust tokenizer (slower):
# pip3 install sqlglot
或者通过本地结帐:
make install
开发要求(可选):
make install-dev
给定版本号MAJOR
。 MINOR
。 PATCH
, SQLGlot 使用以下版本控制策略:
PATCH
版本会增加。MINOR
版本会增加。MAJOR
版本会增加。 我们很乐意听取您的意见。加入我们的社区 Slack 频道!
我尝试解析应该有效的 SQL,但失败了,为什么会发生这种情况?
parse_one(sql, dialect="spark")
(或者: read="spark"
)。如果未指定方言, parse_one
将尝试根据“SQLGlot 方言”解析查询,该方言被设计为所有受支持方言的超集。如果您尝试指定方言但仍然不起作用,请提出问题。我尝试输出 SQL,但它的方言不正确!
parse_one(sql, dialect="spark").sql(dialect="duckdb")
(或者: transpile(sql, read="spark", write="duckdb")
)。我尝试解析无效的 SQL,它成功了,尽管它应该引发错误!为什么它没有验证我的 SQL?
sqlglot.dataframe 发生了什么?
轻松从一种方言翻译成另一种方言。例如,日期/时间函数因方言而异,并且可能难以处理:
import sqlglot
sqlglot . transpile ( "SELECT EPOCH_MS(1618088028295)" , read = "duckdb" , write = "hive" )[ 0 ]
' SELECT FROM_UNIXTIME(1618088028295 / POW(10, 3)) '
SQLGlot 甚至可以转换自定义时间格式:
import sqlglot
sqlglot . transpile ( "SELECT STRFTIME(x, '%y-%-m-%S')" , read = "duckdb" , write = "hive" )[ 0 ]
" SELECT DATE_FORMAT(x, 'yy-M-ss') "
标识符分隔符和数据类型也可以翻译:
import sqlglot
# Spark SQL requires backticks (`) for delimited identifiers and uses `FLOAT` over `REAL`
sql = """WITH baz AS (SELECT a, c FROM foo WHERE a = 1) SELECT f.a, b.b, baz.c, CAST("b"."a" AS REAL) d FROM foo f JOIN bar b ON f.a = b.a LEFT JOIN baz ON f.a = baz.a"""
# Translates the query into Spark SQL, formats it, and delimits all of its identifiers
print ( sqlglot . transpile ( sql , write = "spark" , identify = True , pretty = True )[ 0 ])
WITH ` baz ` AS (
SELECT
` a ` ,
` c `
FROM ` foo `
WHERE
` a ` = 1
)
SELECT
` f ` . ` a ` ,
` b ` . ` b ` ,
` baz ` . ` c ` ,
CAST( ` b ` . ` a ` AS FLOAT) AS ` d `
FROM ` foo ` AS ` f `
JOIN ` bar ` AS ` b `
ON ` f ` . ` a ` = ` b ` . ` a `
LEFT JOIN ` baz `
ON ` f ` . ` a ` = ` baz ` . ` a `
评论也会尽力保留:
sql = """
/* multi
line
comment
*/
SELECT
tbl.cola /* comment 1 */ + tbl.colb /* comment 2 */,
CAST(x AS SIGNED), # comment 3
y -- comment 4
FROM
bar /* comment 5 */,
tbl # comment 6
"""
# Note: MySQL-specific comments (`#`) are converted into standard syntax
print ( sqlglot . transpile ( sql , read = 'mysql' , pretty = True )[ 0 ])
/* multi
line
comment
*/
SELECT
tbl . cola /* comment 1 */ + tbl . colb /* comment 2 */ ,
CAST(x AS INT ), /* comment 3 */
y /* comment 4 */
FROM bar /* comment 5 */ , tbl /* comment 6 */
您可以使用表达式助手探索 SQL,以执行诸如在查询中查找列和表之类的操作:
from sqlglot import parse_one , exp
# print all column references (a and b)
for column in parse_one ( "SELECT a, b + 1 AS c FROM d" ). find_all ( exp . Column ):
print ( column . alias_or_name )
# find all projections in select statements (a and c)
for select in parse_one ( "SELECT a, b + 1 AS c FROM d" ). find_all ( exp . Select ):
for projection in select . expressions :
print ( projection . alias_or_name )
# find all tables (x, y, z)
for table in parse_one ( "SELECT * FROM x JOIN y JOIN z" ). find_all ( exp . Table ):
print ( table . name )
阅读 ast 入门知识以了解有关 SQLGlot 内部结构的更多信息。
当解析器检测到语法错误时,它会引发ParseError
:
import sqlglot
sqlglot . transpile ( "SELECT foo FROM (SELECT baz FROM t" )
sqlglot.errors.ParseError: Expecting ). Line 1, Col: 34.
SELECT foo FROM (SELECT baz FROM t
~
结构化语法错误可用于编程使用:
import sqlglot
try :
sqlglot . transpile ( "SELECT foo FROM (SELECT baz FROM t" )
except sqlglot . errors . ParseError as e :
print ( e . errors )
[{
'description' : 'Expecting )' ,
'line' : 1 ,
'col' : 34 ,
'start_context' : 'SELECT foo FROM (SELECT baz FROM ' ,
'highlight' : 't' ,
'end_context' : '' ,
'into_expression' : None
}]
可能无法在某些方言之间翻译某些查询。对于这些情况,SQLGlot 可能会发出警告,并默认继续进行尽力翻译:
import sqlglot
sqlglot . transpile ( "SELECT APPROX_DISTINCT(a, 0.1) FROM foo" , read = "presto" , write = "hive" )
APPROX_COUNT_DISTINCT does not support accuracy
' SELECT APPROX_COUNT_DISTINCT(a) FROM foo '
可以通过设置unsupported_level
属性来更改此行为。例如,我们可以将其设置为RAISE
或IMMEDIATE
以确保引发异常:
import sqlglot
sqlglot . transpile ( "SELECT APPROX_DISTINCT(a, 0.1) FROM foo" , read = "presto" , write = "hive" , unsupported_level = sqlglot . ErrorLevel . RAISE )
sqlglot.errors.UnsupportedError: APPROX_COUNT_DISTINCT does not support accuracy
有些查询需要准确转换附加信息,例如其中引用的表的架构。这是因为某些转换是类型敏感的,这意味着需要类型推断才能理解它们的语义。尽管qualify
和annotate_types
优化器规则可以帮助解决此问题,但默认情况下不会使用它们,因为它们会增加大量开销和复杂性。
转译通常是一个难题,因此 SQLGlot 采用“增量”方法来解决它。这意味着目前可能有一些方言对缺乏对某些输入的支持,但预计这种情况会随着时间的推移而得到改善。我们非常感谢有据可查且经过测试的问题或 PR,因此如果您需要指导,请随时与我们联系!
SQLGlot 支持增量构建 SQL 表达式:
from sqlglot import select , condition
where = condition ( "x=1" ). and_ ( "y=1" )
select ( "*" ). from_ ( "y" ). where ( where ). sql ()
' SELECT * FROM y WHERE x = 1 AND y = 1 '
可以修改解析树:
from sqlglot import parse_one
parse_one ( "SELECT x FROM y" ). from_ ( "z" ). sql ()
' SELECT x FROM z '
解析表达式还可以通过将映射函数应用于树中的每个节点来递归转换:
from sqlglot import exp , parse_one
expression_tree = parse_one ( "SELECT a FROM x" )
def transformer ( node ):
if isinstance ( node , exp . Column ) and node . name == "a" :
return parse_one ( "FUN(a)" )
return node
transformed_tree = expression_tree . transform ( transformer )
transformed_tree . sql ()
' SELECT FUN(a) FROM x '
SQLGlot 可以将查询重写为“优化”形式。它执行多种技术来创建新的规范 AST。该 AST 可用于标准化查询或为实现实际引擎提供基础。例如:
import sqlglot
from sqlglot . optimizer import optimize
print (
optimize (
sqlglot . parse_one ( """
SELECT A OR (B OR (C AND D))
FROM x
WHERE Z = date '2021-01-01' + INTERVAL '1' month OR 1 = 0
""" ),
schema = { "x" : { "A" : "INT" , "B" : "INT" , "C" : "INT" , "D" : "INT" , "Z" : "STRING" }}
). sql ( pretty = True )
)
SELECT
(
" x " . " a " <> 0 OR " x " . " b " <> 0 OR " x " . " c " <> 0
)
AND (
" x " . " a " <> 0 OR " x " . " b " <> 0 OR " x " . " d " <> 0
) AS " _col_0 "
FROM " x " AS " x "
WHERE
CAST( " x " . " z " AS DATE ) = CAST( ' 2021-02-01 ' AS DATE )
您可以通过调用repr
来查看已解析 SQL 的 AST 版本:
from sqlglot import parse_one
print ( repr ( parse_one ( "SELECT a + 1 AS z" )))
Select (
expressions = [
Alias (
this = Add (
this = Column (
this = Identifier ( this = a , quoted = False )),
expression = Literal ( this = 1 , is_string = False )),
alias = Identifier ( this = z , quoted = False ))])
SQLGlot 可以计算两个表达式之间的语义差异,并以将源表达式转换为目标表达式所需的一系列操作的形式输出更改:
from sqlglot import diff , parse_one
diff ( parse_one ( "SELECT a + b, c, d" ), parse_one ( "SELECT c, a - b, d" ))
[
Remove ( expression = Add (
this = Column (
this = Identifier ( this = a , quoted = False )),
expression = Column (
this = Identifier ( this = b , quoted = False )))),
Insert ( expression = Sub (
this = Column (
this = Identifier ( this = a , quoted = False )),
expression = Column (
this = Identifier ( this = b , quoted = False )))),
Keep (
source = Column ( this = Identifier ( this = a , quoted = False )),
target = Column ( this = Identifier ( this = a , quoted = False ))),
...
]
另请参阅:SQL 的语义差异。
可以通过子类化Dialect
来添加方言:
from sqlglot import exp
from sqlglot . dialects . dialect import Dialect
from sqlglot . generator import Generator
from sqlglot . tokens import Tokenizer , TokenType
class Custom ( Dialect ):
class Tokenizer ( Tokenizer ):
QUOTES = [ "'" , '"' ]
IDENTIFIERS = [ "`" ]
KEYWORDS = {
** Tokenizer . KEYWORDS ,
"INT64" : TokenType . BIGINT ,
"FLOAT64" : TokenType . DOUBLE ,
}
class Generator ( Generator ):
TRANSFORMS = { exp . Array : lambda self , e : f"[ { self . expressions ( e ) } ]" }
TYPE_MAPPING = {
exp . DataType . Type . TINYINT : "INT64" ,
exp . DataType . Type . SMALLINT : "INT64" ,
exp . DataType . Type . INT : "INT64" ,
exp . DataType . Type . BIGINT : "INT64" ,
exp . DataType . Type . DECIMAL : "NUMERIC" ,
exp . DataType . Type . FLOAT : "FLOAT64" ,
exp . DataType . Type . DOUBLE : "FLOAT64" ,
exp . DataType . Type . BOOLEAN : "BOOL" ,
exp . DataType . Type . TEXT : "STRING" ,
}
print ( Dialect [ "custom" ])
<class '__main__.Custom'>
SQLGlot 能够解释 SQL 查询,其中表表示为 Python 字典。该引擎不应该很快,但它对于单元测试和跨 Python 对象本机运行 SQL 很有用。此外,该基础可以轻松地与快速计算内核集成,例如 Arrow 和 Pandas。
下面的示例展示了涉及聚合和联接的查询的执行:
from sqlglot . executor import execute
tables = {
"sushi" : [
{ "id" : 1 , "price" : 1.0 },
{ "id" : 2 , "price" : 2.0 },
{ "id" : 3 , "price" : 3.0 },
],
"order_items" : [
{ "sushi_id" : 1 , "order_id" : 1 },
{ "sushi_id" : 1 , "order_id" : 1 },
{ "sushi_id" : 2 , "order_id" : 1 },
{ "sushi_id" : 3 , "order_id" : 2 },
],
"orders" : [
{ "id" : 1 , "user_id" : 1 },
{ "id" : 2 , "user_id" : 2 },
],
}
execute (
"""
SELECT
o.user_id,
SUM(s.price) AS price
FROM orders o
JOIN order_items i
ON o.id = i.order_id
JOIN sushi s
ON i.sushi_id = s.id
GROUP BY o.user_id
""" ,
tables = tables
)
user_id price
1 4.0
2 3.0
另请参阅:从头开始编写 Python SQL 引擎。
SQLGlot 使用 pdoc 来提供其 API 文档。
SQLGlot 网站上有托管版本,或者您可以使用以下命令在本地构建:
make docs-serve
make style # Only linter checks
make unit # Only unit tests (or unit-rs, to use the Rust tokenizer)
make test # Unit and integration tests (or test-rs, to use the Rust tokenizer)
make check # Full test suite & linter checks
基准测试在 Python 3.10.12 上运行只需几秒钟。
询问 | sqlglot | sqlglotrs | sqlfluff | sql树 | sql解析 | moz_sql_解析器 | 氧化硅 |
---|---|---|---|---|---|---|---|
TPCH | 0.00944 (1.0) | 0.00590 (0.625) | 0.32116 (33.98) | 0.00693 (0.734) | 0.02858 (3.025) | 0.03337 (3.532) | 0.00073 (0.077) |
短的 | 0.00065 (1.0) | 0.00044 (0.687) | 0.03511 (53.82) | 0.00049 (0.759) | 0.00163 (2.506) | 0.00234 (3.601) | 0.00005 (0.073) |
长的 | 0.00889 (1.0) | 0.00572 (0.643) | 0.36982 (41.56) | 0.00614 (0.690) | 0.02530 (2.844) | 0.02931 (3.294) | 0.00059 (0.066) |
疯狂的 | 0.02918 (1.0) | 0.01991 (0.682) | 1.88695 (64.66) | 0.02003 (0.686) | 7.46894 (255.9) | 0.64994 (22.27) | 0.00327 (0.112) |
SQLGlot 使用 dateutil 来简化文字 timedelta 表达式。如果找不到模块,优化器将不会简化如下表达式:
x + interval ' 1 ' month