Selective search is a Distributed Information Retrieval(DIR) project for large-scale text search in which collection of documents are partitioned into subsets(known as shards) based on documents corpus similarity, thus enabling search across fewer shards instead of all shards.
Goal of the project is to implement an efficient and effective distributed search system without affecting the search quality and aiming to reduce search cost.
The methodology of decomposing a large-scale collection into subunits based on its records content similarity is termed as “Topic Based Partitioning”, and searching across only fewer relevant shards for given search query is termed as “Selective Search” in a publication by Kulkarni et al.
Selective Search is programmed in Scala. It extends libraries of Apache Spark MLLib for unsupervised clustering and distributed computing, Apache Solr libraries for Search and Information Retrieval.
Java JDK 8 version "1.8.0_131"
Scala 2.12.2
Apache Spark version 2.1.1
Apache solr 6.6.2
Spark-Solr 3.3.0
Apache Maven 3.5.0
Apache Ivy 2.4.0
Apache ANT version 1.10.1
To compile the current version of selective-search, you will need to have the following list of software running on your machine.
In order to verify the above listed softwares are running on your machine, confirm with commands below.
java -version
scala -version
mvn -version
After verification of required softwares setup, download the source code and execute the below command.
mvn clean install -DskipTests
To run the selective-search project on localhost(machine), it is required for Apache SolrCloud to be configured. If you do not have it already configured, follow instructions provided here: ()
After the SolrCloud setup, there are two ways to run selective-search, either set it up on an IDE—it could be either IntelliJ/Eclipse or launch a Spark Cluster and execute job on it.
git clone https://github.com/rajanim/selective-search.git
App.scala
, right click and run. It should print 'Hello World!' This confirms scala code is compiled and running successfully.TestTextFileParser.scala
, modify the root path as per your directory settings, run the test cases(right click and choose run option)org.sfsu.cs.io.newsgroup.NewsgroupRunner.scala
, modify input directory path to your local machine directory path, run.Configure Spark Cluster on localhost.
spark-2.0.0-bin-hadoop2.7
)cd /Users/user_name/softwares/spark-2.0.0-bin-hadoop2.7
./sbin/start-master.sh
create spark-env.sh file using the provided template:
cp ./conf/spark-env.sh.template ./conf/spark-env.sh
append a configuration param to the end of the file:
echo "export SPARK_WORKER_INSTANCES=4" >> ./conf/spark-env.sh
./sbin/start-slaves.sh <master_url> [master url: available or seen on spark web-ui after master is started, at: localhost:8080]
Run the selective search project on spark cluster.
nohup ./bin/spark-submit --master spark://RajaniM-1159:7077 --num-executors 2 --executor-memory 8g --driver-memory 12g --conf "spark.rpc.message.maxSize=2000" --conf "spark.driver.maxResultSize=2000" --class com.sfsu.cs.main.TopicalShardsCreator /Users/rajanishivarajmaski/selective-search/target/selective-search-1.0-SNAPSHOT.jar TopicalShardsCreator -zkHost localhost:9983 -collection word-count -warcFilesPath /Users/rajani.maski/rm/cluweb_catb_part/ -dictionaryLocation /Users/rajani.maski/rm/spark-solr-899/ -numFeatures 25000 -numClusters 50 -numPartitions 50 -numIterations 30 &
Download Apache Solr as zip(current implementation runs on solr 6.2.1)
Got to terminal and navigate to path where solr unzip directory is located.
Start solr in cloud mode
./bin/solr start -c -m 4g
Once you have solr started with welcome message as 'Happy Searching!', you should be able to connect to its admin UI
http://localhost:8983/solr/#/
, navigate to collections: http://localhost:8983/solr/#/~collections
Add a collection by giving a collection name, choose config set, it should be data_driven_schema_configs. Other key value pair inputs as seen here in screenshot : click here
Confirm if the collection is created succesfully by navigating to cloud admin ui http://localhost:8983/solr/#/~cloud
and you should be able to see newly created collection
Another approach to create a collection is via command line. Example command ./bin/solr create_collection -n data_driven_schema_configs -c collection_name -p portNumber -shards numShards
For Selective Search
, we require a solr collection with implicit routing
strategy.
http://localhost:8983/solr/admin/collections?action=CREATE&name=NewsGroupImplicit&numShards=20&replicationFactor=1&maxShardsPerNode=20&router.name=implicit&shards=shard1,shard2,shard3,shard4,shard5,shard6,shard7,shard8,shard9,shard10,shard11,shard12,shard13,shard14,shard15,shard16,shard17,shard18,shard19,shard20&collection.configName=data_driven_schema_configs&router.field=_route_
http://localhost:8983/solr/banana/src/index.html#/dashboard
Follow the steps listed below to execute(run) selective search for any other(custom/specific)dataset.
org.sfsu.cs.main.SparkInstance
and invoking createSparkContext
method
val sc = sparkInstance.createSparkContext("class name")
org.sfsu.cs.io.text.TextFileParser.scala
to obtain text docs mapped to org.sfsu.cs.document.TFDocument
org.sfsu.cs.io.clueweb09
org.sfsu.cs.io.csv
Required configurations
Tuning
"spark.driver.maxResultSize=2000"
it cannote exceed 2048./bin/solr start -c -m 4g
https://issues.apache.org/jira/browse/SOLR-4114
Fix : Make sure maxShardsPerNode
is equal to numShards
based on number of nodes available.If you have additional questions related to implementation, experimental results, benchmarks or usage of Selective-Search
, happy to help. Please email to [email protected]