dbworld search
1.0.0
使用Django-2.1.3, python3.6 實現的一個非常非常naive 的搜尋引擎.
我初學django, 寫得併不熟練, 所以此程式碼僅供參考.
首頁
分頁
我們要保存一個倒排索引, 以及一個主題對應的發送時間, 發送者, 主題, 主題鏈接等內容. 所以我設計了下面的數據庫結構.
isIndexed
)需要注意的是增加一個鍵值對不能使用下面程式碼
word . index [ doc . id ] = num
word . save ()
應該
dic = word . index
dic [ doc . id ] = num
word . index = dic
word . save ()
下面給的是django 中model 的程式碼
from django . db import models
class Doc ( models . Model ):
sendTime = models . DateField () # 2018-12-12 , differ from DateTimeField which can be datetime or date
sender = models . CharField ( max_length = 20 )
messageType = models . CharField ( max_length = 20 ) # Journal, conf, et al
subject = models . CharField ( max_length = 100 )
begin = models . DateField ()
deadline = models . DateField ()
subjectUrl = models . CharField ( max_length = 100 )
webpageUrl = models . CharField ( max_length = 100 )
desc = models . CharField ( max_length = 250 , default = '' )
loc = models . CharField ( max_length = 40 , default = '' )
keywords = models . CharField ( max_length = 200 , default = '' )
def __str__ ( self ):
return self . subjectUrl
import json
class Wordindex ( models . Model ):
word = models . CharField ( max_length = 45 )
# model to store a list, another way is to create a custom field
_index = models . TextField ( null = True )
@ property
def index ( self ):
return json . loads ( self . _index )
@ index . setter
def index ( self , li ):
self . _index = json . dumps ( li )
def __str__ ( self ):
return self . word
class File ( models . Model ):
doc = models . OneToOneField ( Doc , on_delete = models . CASCADE )
content = models . TextField ( null = True )
isIndexed = models . BooleanField ( default = False )
def __str__ ( self ):
return 'file: {} -> doc: {}' . format ( self . id , self . doc . id )
首先是主頁其結構是這樣
< TBODY >
< TR VALIGN = TOP >
< TD > 03-Jan-2019 </ TD >
< TD > conf. ann. </ TD >
< TD > marta cimitile </ TD >
< TD > < A HREF =" http://www.cs.wisc.edu/dbworld/messages/2019-01/1546520301.html " rel =" nofollow " > Call forFUZZ IEEE Special Session </ A > </ TD >
< TD > 13-Jan-2019 </ TD >
< TD > < A rel =" nofollow " HREF =" http://sites.ieee.org/fuzzieee-2019/special-sessions/ " > web page </ A > </ TD >
</ TR > </ TBODY >
有規律性, 可以直接提取. 在實現時, 我用的python 的BeautifulSoup 包來提取.
使用過程中, 關鍵是傳遞解析器, 試過了html, lxml 有問題, 最後用的html5lib
然後是上面一行表格中的第四列(即第四個td 標籤), 其中的<a>
標籤是主題所在的網頁連結. 也要進行提取
由於時間, 地點具有一般的模式, 可以列舉出常見的模式, 使用正規表示式匹配
使用了textrank 演算法最開始我自己實作了一個很基礎的textrank 演算法, 效果很差, 後來就使用了text-rank 的官方版本.
這部分就是按照倒排索引的原理, 將網頁文本分詞, 去除標點符號等, 然後使用上面介紹的數據庫模型存儲倒排索引.
首先是標題下面是一行是一排選項, 可以根據這些字段排序. 接著一行有一個update 按鈕, 一個搜索提交表格,
下面的內容就是用div
排列起來的搜尋結果.
每個結果包含一個標題, 關鍵字, 時間,地點, 還有摘要.
這裡我自己實作了tf-idf
演算法來排序結果. 程式碼如下
def tfidf ( words ):
if not words : return docs
ct = process ( words )
weight = {}
tf = {}
for term in ct :
try :
tf [ term ] = Wordindex . objects . get ( word = term ). index
except Exception as e :
print ( e )
tf [ term ] = {}
continue
for docid in tf [ term ]:
if docid not in weight :
weight [ docid ] = 0
N = len ( weight )
for term in ct :
dic = tf [ term ]
for docid , freq in dic . items ():
w = ( 1 + log10 ( freq )) * ( log10 ( N / len ( dic ))) * ct [ term ]
if term in stopWords :
w *= 0.3
weight [ docid ] += w
ids = sorted ( weight , key = lambda k : weight [ k ], reverse = True )
if len ( ids ) < 8 : pass #???
return [ Doc . objects . get ( id = int ( i )). __dict__ for i in ids ]