dbworld searchダウンロード - dbworld searchソースコードのダウンロード

dbworld search

その他のソースコード

1.0.0

ダウンロード

検索エンジンの実装

Django-2.1.3、python3.6 を使用して実装された非常に単純な検索エンジン。

私は Django を初めて使用しており、書くことにあまり熟練していないため、このコードは参考用です。

必要

プログラミング言語：python3
動作環境：Linux、シェル
使用したツール:
- ジャンゴ-2.1.3
- Python3.6
  - summa (テキストランク)
  - DJページネーション
  - 美しいスープ

結果表示

フロントページ
ページネーション

デザイン

設計データ構造

トピックに対応する送信時刻、送信者、件名、トピックリンクなどを転置インデックスとして保存する必要があるため、次のようなデータベース構造を設計しました。

ドキュメント: いくつかの主要な情報を含む文書、つまり Web ページ。
ファイル: 外部キーは Doc で、Web ページファイルのテキストコンテンツと、タグがインデックス付けされているかどうか ( isIndexed ) が含まれます。
Wordindex: これは、用語を含む転置インデックス内の項目であり、転置インデックステーブルはハッシュテーブルの形式で設計されており、キーは Doc.id で、値はその回数です。簡単にするために、データベースライブラリの保存形式は、上記のハッシュテーブル (Python の dict 型) をテキスト文字列として保存します。

キーと値のペアを追加するには、次のコードを使用できないことに注意してください。

 word . index [ doc . id ] = num
word . save ()

すべき

 dic = word . index
dic [ doc . id ] = num
word . index = dic
word . save ()

以下は、Django のモデルのコードです。

 from django . db import models
class Doc ( models . Model ):
    sendTime = models . DateField () # 2018-12-12 ,  differ from DateTimeField which can be datetime or date
    sender = models . CharField ( max_length = 20 )
    messageType = models . CharField ( max_length = 20 ) # Journal, conf, et al
    subject = models . CharField ( max_length = 100 )
    begin = models . DateField ()
    deadline = models . DateField ()
    subjectUrl = models . CharField ( max_length = 100 )
    webpageUrl = models . CharField ( max_length = 100 )
    desc = models . CharField ( max_length = 250 , default = '' )
    loc = models . CharField ( max_length = 40 , default = '' )
    keywords = models . CharField ( max_length = 200 , default = '' )

    def __str__ ( self ):
        return self . subjectUrl

import json
class Wordindex ( models . Model ):
    word = models . CharField ( max_length = 45 )

    # model to store a list, another way is to create a custom field
    _index = models . TextField ( null = True )
    @ property
    def index ( self ):
        return json . loads ( self . _index )
    @ index . setter
    def index ( self , li ):
        self . _index = json . dumps ( li )
    def __str__ ( self ):
        return self . word
class File ( models . Model ):
    doc = models . OneToOneField ( Doc , on_delete = models . CASCADE )
    content = models . TextField ( null = True )
    isIndexed = models . BooleanField ( default = False )
    def __str__ ( self ):
        return 'file: {} -> doc: {}' . format ( self . id , self . doc . id )

Webページの抽出

まず、ホームページの構成は次のとおりです。

 < TBODY >
< TR VALIGN = TOP >
< TD > 03-Jan-2019 </ TD >
< TD > conf. ann. </ TD >
< TD > marta cimitile </ TD >
< TD > < A HREF =" http://www.cs.wisc.edu/dbworld/messages/2019-01/1546520301.html " rel =" nofollow " > Call forFUZZ IEEE Special Session </ A > </ TD >
< TD > 13-Jan-2019 </ TD >
< TD > < A rel =" nofollow " HREF =" http://sites.ieee.org/fuzzieee-2019/special-sessions/ " > web page </ A > </ TD >
</ TR > </ TBODY >

規則性があり直接抽出することも可能ですが、実装時にはPythonのBeautifulSoupパッケージを利用して抽出しました。

使用中に重要なのはパーサーを渡すことです。html を試しましたが、lxml に問題がありました。最終的には html5lib を使用しました。

次に、上の表の 4 番目の列 (つまり、4 番目の td タグ) があります。ここで、 <a>タグは、トピックが配置されている Web ページへのリンクです。これも抽出する必要があります。

抽出時間と場所

時間と場所には一般的なパターンがあるため、正規表現を使用して一般的なパターンをリストし、一致させることができます

要約、キーワードを抽出する

textrank アルゴリズムを使用して、最初は非常に基本的な textrank アルゴリズムを自分で実装しましたが、効果は非常に貧弱で、その後、text-rank の公式バージョンを使用しました。

索引

この部分は転置インデックスの原理に基づいており、Web ページのテキストを単語に分割し、句読点などを削除し、上で紹介したデータベースモデルを使用して転置インデックスを保存します。

Web ページのデザイン

まず、タイトルの下にオプションの行があり、これらのフィールドに従って並べ替えることができます。次に、次の行に更新ボタンと検索送信フォームがあります。

以下の内容は検索結果をdivで整理したものです。

各結果には、タイトル、キーワード、時間、場所、概要が含まれます。

検索ソート

ここでは、結果を並べ替えるためにtf-idfアルゴリズムを自分で実装しました。コードは次のとおりです。

 def tfidf ( words ):
    if not words : return docs
    ct = process ( words )
    weight = {}
    tf = {}
    for term in ct :
        try :
            tf [ term ] = Wordindex . objects . get ( word = term ). index
        except Exception as e :
            print ( e )
            tf [ term ] = {}
            continue
        for docid in tf [ term ]:
            if docid not in weight :
                weight [ docid ] = 0
    N = len ( weight )
    for term in ct :
        dic = tf [ term ]
        for docid , freq in dic . items ():
            w = ( 1 + log10 ( freq )) * ( log10 ( N / len ( dic ))) * ct [ term ]
            if term in stopWords :
                w *= 0.3
            weight [ docid ] += w
    ids = sorted ( weight , key = lambda k : weight [ k ], reverse = True )
    if len ( ids ) < 8 : pass #???
    return [ Doc . objects . get ( id = int ( i )). __dict__ for i in ids ]