dbworld search下載 - dbworld search原始碼下載

dbworld search

其他源碼

1.0.0

下載

搜尋引擎實現

使用Django-2.1.3, python3.6 實現的一個非常非常naive 的搜尋引擎.

我初學django, 寫得併不熟練, 所以此程式碼僅供參考.

需要

程式語言: python3
運行環境: linux, shell
使用工具:
- Django-2.1.3
- python3.6
  - summa (text-rank)
  - dj-pagination
  - BeautifulSoup

結果展示

首頁
分頁

設計

設計資料結構

我們要保存一個倒排索引, 以及一個主題對應的發送時間, 發送者, 主題, 主題鏈接等內容. 所以我設計了下面的數據庫結構.

Doc: 一個文件, 也就是一個網頁, 包含一些主要資訊.
File: 外鍵是Doc, 包含了網頁檔案的文字內容, 以及標記是否已經被索引( isIndexed )
Wordindex: 這就是倒排索引中的一個項目, 包含一個term, 和倒排索引表, 倒排索引表設計成hashtable 形式, 鍵為Doc. id, 值為在Doc 中出現的次數. 為了簡便,在資料庫庫中的儲存形式是將上面的hashtable (在python 中為dict 類型) 用json 格式儲存為文字字串形式.

需要注意的是增加一個鍵值對不能使用下面程式碼

 word . index [ doc . id ] = num
word . save ()

應該

 dic = word . index
dic [ doc . id ] = num
word . index = dic
word . save ()

下面給的是django 中model 的程式碼

 from django . db import models
class Doc ( models . Model ):
    sendTime = models . DateField () # 2018-12-12 ,  differ from DateTimeField which can be datetime or date
    sender = models . CharField ( max_length = 20 )
    messageType = models . CharField ( max_length = 20 ) # Journal, conf, et al
    subject = models . CharField ( max_length = 100 )
    begin = models . DateField ()
    deadline = models . DateField ()
    subjectUrl = models . CharField ( max_length = 100 )
    webpageUrl = models . CharField ( max_length = 100 )
    desc = models . CharField ( max_length = 250 , default = '' )
    loc = models . CharField ( max_length = 40 , default = '' )
    keywords = models . CharField ( max_length = 200 , default = '' )

    def __str__ ( self ):
        return self . subjectUrl

import json
class Wordindex ( models . Model ):
    word = models . CharField ( max_length = 45 )

    # model to store a list, another way is to create a custom field
    _index = models . TextField ( null = True )
    @ property
    def index ( self ):
        return json . loads ( self . _index )
    @ index . setter
    def index ( self , li ):
        self . _index = json . dumps ( li )
    def __str__ ( self ):
        return self . word
class File ( models . Model ):
    doc = models . OneToOneField ( Doc , on_delete = models . CASCADE )
    content = models . TextField ( null = True )
    isIndexed = models . BooleanField ( default = False )
    def __str__ ( self ):
        return 'file: {} -> doc: {}' . format ( self . id , self . doc . id )

網頁擷取

首先是主頁其結構是這樣

 < TBODY >
< TR VALIGN = TOP >
< TD > 03-Jan-2019 </ TD >
< TD > conf. ann. </ TD >
< TD > marta cimitile </ TD >
< TD > < A HREF =" http://www.cs.wisc.edu/dbworld/messages/2019-01/1546520301.html " rel =" nofollow " > Call forFUZZ IEEE Special Session </ A > </ TD >
< TD > 13-Jan-2019 </ TD >
< TD > < A rel =" nofollow " HREF =" http://sites.ieee.org/fuzzieee-2019/special-sessions/ " > web page </ A > </ TD >
</ TR > </ TBODY >

有規律性, 可以直接提取. 在實現時, 我用的python 的BeautifulSoup 包來提取.

使用過程中, 關鍵是傳遞解析器, 試過了html, lxml 有問題, 最後用的html5lib

然後是上面一行表格中的第四列(即第四個td 標籤), 其中的<a>標籤是主題所在的網頁連結. 也要進行提取

提取時間, 地點

由於時間, 地點具有一般的模式, 可以列舉出常見的模式, 使用正規表示式匹配

提取摘要, 關鍵字

使用了textrank 演算法最開始我自己實作了一個很基礎的textrank 演算法, 效果很差, 後來就使用了text-rank 的官方版本.

建立索引

這部分就是按照倒排索引的原理, 將網頁文本分詞, 去除標點符號等, 然後使用上面介紹的數據庫模型存儲倒排索引.

設計網頁

首先是標題下面是一行是一排選項, 可以根據這些字段排序. 接著一行有一個update 按鈕, 一個搜索提交表格,

下面的內容就是用div排列起來的搜尋結果.

每個結果包含一個標題, 關鍵字, 時間,地點, 還有摘要.

尋找排序

這裡我自己實作了tf-idf演算法來排序結果. 程式碼如下

 def tfidf ( words ):
    if not words : return docs
    ct = process ( words )
    weight = {}
    tf = {}
    for term in ct :
        try :
            tf [ term ] = Wordindex . objects . get ( word = term ). index
        except Exception as e :
            print ( e )
            tf [ term ] = {}
            continue
        for docid in tf [ term ]:
            if docid not in weight :
                weight [ docid ] = 0
    N = len ( weight )
    for term in ct :
        dic = tf [ term ]
        for docid , freq in dic . items ():
            w = ( 1 + log10 ( freq )) * ( log10 ( N / len ( dic ))) * ct [ term ]
            if term in stopWords :
                w *= 0.3
            weight [ docid ] += w
    ids = sorted ( weight , key = lambda k : weight [ k ], reverse = True )
    if len ( ids ) < 8 : pass #???
    return [ Doc . objects . get ( id = int ( i )). __dict__ for i in ids ]