A very naive search engine implemented using Django-2.1.3, python3.6.
I am new to Django and am not very proficient in writing, so this code is for reference only.
front page
Pagination
We need to save an inverted index, as well as the sending time, sender, subject, topic link, etc. corresponding to a topic. So I designed the following database structure.
isIndexed
)It should be noted that adding a key-value pair cannot use the following code
word . index [ doc . id ] = num
word . save ()
should
dic = word . index
dic [ doc . id ] = num
word . index = dic
word . save ()
Given below is the code for model in django
from django . db import models
class Doc ( models . Model ):
sendTime = models . DateField () # 2018-12-12 , differ from DateTimeField which can be datetime or date
sender = models . CharField ( max_length = 20 )
messageType = models . CharField ( max_length = 20 ) # Journal, conf, et al
subject = models . CharField ( max_length = 100 )
begin = models . DateField ()
deadline = models . DateField ()
subjectUrl = models . CharField ( max_length = 100 )
webpageUrl = models . CharField ( max_length = 100 )
desc = models . CharField ( max_length = 250 , default = '' )
loc = models . CharField ( max_length = 40 , default = '' )
keywords = models . CharField ( max_length = 200 , default = '' )
def __str__ ( self ):
return self . subjectUrl
import json
class Wordindex ( models . Model ):
word = models . CharField ( max_length = 45 )
# model to store a list, another way is to create a custom field
_index = models . TextField ( null = True )
@ property
def index ( self ):
return json . loads ( self . _index )
@ index . setter
def index ( self , li ):
self . _index = json . dumps ( li )
def __str__ ( self ):
return self . word
class File ( models . Model ):
doc = models . OneToOneField ( Doc , on_delete = models . CASCADE )
content = models . TextField ( null = True )
isIndexed = models . BooleanField ( default = False )
def __str__ ( self ):
return 'file: {} -> doc: {}' . format ( self . id , self . doc . id )
The first is the homepage. Its structure is as follows
< TBODY >
< TR VALIGN = TOP >
< TD > 03-Jan-2019 </ TD >
< TD > conf. ann. </ TD >
< TD > marta cimitile </ TD >
< TD > < A HREF =" http://www.cs.wisc.edu/dbworld/messages/2019-01/1546520301.html " rel =" nofollow " > Call forFUZZ IEEE Special Session </ A > </ TD >
< TD > 13-Jan-2019 </ TD >
< TD > < A rel =" nofollow " HREF =" http://sites.ieee.org/fuzzieee-2019/special-sessions/ " > web page </ A > </ TD >
</ TR > </ TBODY >
There are regularities and can be extracted directly. During implementation, I used python’s BeautifulSoup package to extract.
During use, the key is to pass the parser. I tried html, but there was a problem with lxml. Finally, I used html5lib.
Then there is the fourth column (that is, the fourth td tag) in the table above, where the <a>
tag is the link to the web page where the topic is located. It also needs to be extracted.
Since time and place have general patterns, common patterns can be listed and matched using regular expressions
Using the textrank algorithm, I first implemented a very basic textrank algorithm myself, but the effect was very poor. Later, I used the official version of text-rank.
This part is based on the principle of the inverted index, segmenting the web page text into words, removing punctuation marks, etc., and then using the database model introduced above to store the inverted index.
First, below the title is a row of options, which can be sorted according to these fields. Then there is an update button on the next row, and a search submission form.
The following content is the search results arranged using div
.
Each result contains a title, keywords, time, location, and summary.
Here I implemented the tf-idf
algorithm myself to sort the results. The code is as follows
def tfidf ( words ):
if not words : return docs
ct = process ( words )
weight = {}
tf = {}
for term in ct :
try :
tf [ term ] = Wordindex . objects . get ( word = term ). index
except Exception as e :
print ( e )
tf [ term ] = {}
continue
for docid in tf [ term ]:
if docid not in weight :
weight [ docid ] = 0
N = len ( weight )
for term in ct :
dic = tf [ term ]
for docid , freq in dic . items ():
w = ( 1 + log10 ( freq )) * ( log10 ( N / len ( dic ))) * ct [ term ]
if term in stopWords :
w *= 0.3
weight [ docid ] += w
ids = sorted ( weight , key = lambda k : weight [ k ], reverse = True )
if len ( ids ) < 8 : pass #???
return [ Doc . objects . get ( id = int ( i )). __dict__ for i in ids ]