dbworld search Download - dbworld search Source code download

dbworld search

Other source code

1.0.0

Download

Search engine implementation

A very naive search engine implemented using Django-2.1.3, python3.6.

I am new to Django and am not very proficient in writing, so this code is for reference only.

need

Programming language: python3
Operating environment: linux, shell
Tools used:
- Django-2.1.3
- python3.6
  - summa (text-rank)
  - dj-pagination
  - BeautifulSoup

Results display

front page
Pagination

design

Design data structure

We need to save an inverted index, as well as the sending time, sender, subject, topic link, etc. corresponding to a topic. So I designed the following database structure.

Doc: A document, that is, a web page, containing some main information.
File: The foreign key is Doc, which contains the text content of the web page file and whether the tag has been indexed ( isIndexed )
Wordindex: This is an item in the inverted index, including a term, and an inverted index table. The inverted index table is designed in the form of a hashtable, the key is Doc.id, and the value is the number of times it appears in Doc. For simplicity, in The storage form in the database library is to save the above hashtable (dict type in Python) in json format as a text string.

It should be noted that adding a key-value pair cannot use the following code

 word . index [ doc . id ] = num
word . save ()

should

 dic = word . index
dic [ doc . id ] = num
word . index = dic
word . save ()

Given below is the code for model in django

 from django . db import models
class Doc ( models . Model ):
    sendTime = models . DateField () # 2018-12-12 ,  differ from DateTimeField which can be datetime or date
    sender = models . CharField ( max_length = 20 )
    messageType = models . CharField ( max_length = 20 ) # Journal, conf, et al
    subject = models . CharField ( max_length = 100 )
    begin = models . DateField ()
    deadline = models . DateField ()
    subjectUrl = models . CharField ( max_length = 100 )
    webpageUrl = models . CharField ( max_length = 100 )
    desc = models . CharField ( max_length = 250 , default = '' )
    loc = models . CharField ( max_length = 40 , default = '' )
    keywords = models . CharField ( max_length = 200 , default = '' )

    def __str__ ( self ):
        return self . subjectUrl

import json
class Wordindex ( models . Model ):
    word = models . CharField ( max_length = 45 )

    # model to store a list, another way is to create a custom field
    _index = models . TextField ( null = True )
    @ property
    def index ( self ):
        return json . loads ( self . _index )
    @ index . setter
    def index ( self , li ):
        self . _index = json . dumps ( li )
    def __str__ ( self ):
        return self . word
class File ( models . Model ):
    doc = models . OneToOneField ( Doc , on_delete = models . CASCADE )
    content = models . TextField ( null = True )
    isIndexed = models . BooleanField ( default = False )
    def __str__ ( self ):
        return 'file: {} -> doc: {}' . format ( self . id , self . doc . id )

Web page extraction

The first is the homepage. Its structure is as follows

 < TBODY >
< TR VALIGN = TOP >
< TD > 03-Jan-2019 </ TD >
< TD > conf. ann. </ TD >
< TD > marta cimitile </ TD >
< TD > < A HREF =" http://www.cs.wisc.edu/dbworld/messages/2019-01/1546520301.html " rel =" nofollow " > Call forFUZZ IEEE Special Session </ A > </ TD >
< TD > 13-Jan-2019 </ TD >
< TD > < A rel =" nofollow " HREF =" http://sites.ieee.org/fuzzieee-2019/special-sessions/ " > web page </ A > </ TD >
</ TR > </ TBODY >

There are regularities and can be extracted directly. During implementation, I used python’s BeautifulSoup package to extract.

During use, the key is to pass the parser. I tried html, but there was a problem with lxml. Finally, I used html5lib.

Then there is the fourth column (that is, the fourth td tag) in the table above, where the <a> tag is the link to the web page where the topic is located. It also needs to be extracted.

Extraction time and place

Since time and place have general patterns, common patterns can be listed and matched using regular expressions

Extract abstracts, keywords

Using the textrank algorithm, I first implemented a very basic textrank algorithm myself, but the effect was very poor. Later, I used the official version of text-rank.

index

This part is based on the principle of the inverted index, segmenting the web page text into words, removing punctuation marks, etc., and then using the database model introduced above to store the inverted index.

Design web pages

First, below the title is a row of options, which can be sorted according to these fields. Then there is an update button on the next row, and a search submission form.

The following content is the search results arranged using div .

Each result contains a title, keywords, time, location, and summary.

Find sort

Here I implemented the tf-idf algorithm myself to sort the results. The code is as follows

 def tfidf ( words ):
    if not words : return docs
    ct = process ( words )
    weight = {}
    tf = {}
    for term in ct :
        try :
            tf [ term ] = Wordindex . objects . get ( word = term ). index
        except Exception as e :
            print ( e )
            tf [ term ] = {}
            continue
        for docid in tf [ term ]:
            if docid not in weight :
                weight [ docid ] = 0
    N = len ( weight )
    for term in ct :
        dic = tf [ term ]
        for docid , freq in dic . items ():
            w = ( 1 + log10 ( freq )) * ( log10 ( N / len ( dic ))) * ct [ term ]
            if term in stopWords :
                w *= 0.3
            weight [ docid ] += w
    ids = sorted ( weight , key = lambda k : weight [ k ], reverse = True )
    if len ( ids ) < 8 : pass #???
    return [ Doc . objects . get ( id = int ( i )). __dict__ for i in ids ]

insufficient

Extracting web page themes still needs to be improved. In terms of extraction location, sometimes it may not be extracted.
The web design could be more beautiful.
Search engine performance has not yet been evaluated

Expand

Additional Information

Version 1.0.0
Type Other source code
Update Time 2024-12-20
size 5.63MB
From Github

Related Applications

Word Search 800

2024-11-08
azure search python samples

2024-11-05
Word Search Word Puzzle Game Latest Version

2024-07-11
Word Search for kids game latest version

2023-10-08
Hanfox Search Engine

2012-03-15
Liehuo! Search English search

2011-01-07

Recommended for You

chat.petals.dev

Other source code

1.0.0
GPT Prompt Templates

Other source code

1.0.0
GPTyped

Other source code

GPTyped 1.0.5
waymo open dataset

Other source code

December 2023 Update
SmartTube

Other source code

24.71 Stable
Sunamu

Other source code

Release 2.2.0
waymo open dataset

Other source code

December 2023 Update
wp functions

Other categories

1.0.0
termwind

Other categories

v2.3.0

Related Information All