Python is a useful language. However, it can go wrong due to the scripting language's lack of strict type evaluation and some expensive operations. Based on my personal experience, I have often been troubled by irreproducible errors in Python in programs where the analysis target becomes large and parallelism is required. I casually tried Kotlin and found it to be quite easy to use, so I tried porting the Scraper that was implemented in Python3.
(By the way, I have never really touched Java)
This is a Kotlin newbie, but Python threads and Kotlin threads seem to behave differently.
Although Kotlin's Thread does not split the process, the CPU usage rate exceeds 100%, so it seems that efficient threading is being performed using multiple CPUs. (In other words, after dividing with Multiprocess, there seems to be no need to run Thread under it)
This is how to install on Ubuntu.
$ curl -s https://get.sdkman.io | bash
$ sdk install kotlin
It seems that the JVM memory is recorded in the environment variable JAVA_OPT, and if you use it normally, it will crash due to lack of memory, so it would be better to fix it in a modern way. I have the settings like this.
JAVA_OPTS= " -Xmx3000M -Xms3000M "
I'm not good at Java, and I was thinking of pursuing a career plan that would avoid Java as much as possible, but I feel like it would be difficult to learn the tools themselves, mainly Eclipse and IDE. Because there were a lot of things.
It would be convenient to use Kotlin with an IDE, but I think CUI is fine as long as there are no problems compiling and running with CUI.
There are many ways to compile, but we found it more usable to compile it into a jar file, including the runtime.
$ kotlinc foo.kt -include-runtime -d foo.jar
Now you can compile.
You can combine multiple files into a jar. (You can refer to bar.kt's functions and Class in foo.kt)
$ kotlinc foo.kt bar.kt -include-runtime -d foo.jar
It can be used by adding a jar file that can be compiled using Java's Maven etc. to the classpath. (Suppose you use files alice.jar and bob.jar)
This is very helpful as it allows us to reuse many Java assets.
$ kotlinc foo.kt bar.kt -cp alice.jar:bob.jar -include-runtime foo.jar
For example, when executing a kotlin jar using an external jar file, the command will be like this.
$ kotlin -cp alice.jar:bob.jar:foo.jar FooKt
This name FooKt seems to be used to specify the foo.kt file that contains the main function.
When there is asynchronous data loading using JavaScript, if you simply retrieve it and analyze it with jsoup etc., you will not be able to get the content.You have to run JavaScript to create a state similar to what a human would see. , run phamtomjs via selenium to make JavaScript work For example, Microsoft Bing's image search is rendered with Ajax and cannot work in an environment where JavaScript does not work. (This is for experimental purposes, so when actually scraping images, please do so via the API.)
val driver = PhantomJSDriver ()
driver.manage().window().setSize( Dimension ( 4096 , 2160 ))
driver.get( " https://www.bing.com/images/search?q= ${encoded} " )
//すべての画像が描画されるのを待つ
Thread .sleep( 3001 )
val html = driver.getPageSource()
The html variable will contain the rendered html after JavaScript runs. By putting this in jsoup, you can find the src URL of various images. Based on the URL of the image you found, use the wget command to store it in a folder under any directory.
val doc = Jsoup .parse(html.toString(), " UTF-8 " )
println (doc.title())
doc.select( " img " ).filter { x ->
x.attr( " class " ) == " mimg "
}.map { x ->
val data_bm = x.attr( " data-bm " )
val src = x.attr( " src " )
Runtime .getRuntime().exec( " wget ${src} -O imgs/ ${name} / ${data_bm} .png " )
}
PhantomJS needs to be downloaded from this site and placed in the PATH.
There seem to be several ways to write this, but this is the easiest implementation.
The entire logic to be scraped enclosed in {} becomes a thread instance, and you can start or join that thread to run it in parallel.
val threads = url_details.keys.map { url ->
val th = Thread {
if (url_details[url] !! == "まだ" ) {
_parser (url).map { next ->
urls.add(next)
}
println ( "終わりに更新 : $url " )
url_details[url] = "終わり"
// save urls
_save_conf ( mapper.writeValueAsString(url_details) )
}
}
th
}
It seems that a serialization module called jackson can be used for a limited time.
It seems that the Java library alone does not work, and you need to load the module for Kotlin separately.
By limited, I tried serializing and deserializing MutableMap<String, DataClass> and it didn't work.
MutableMap<String, String> works fine, so I'm not sure if the nested structure is bad or if it doesn't support Data Class.
Serialization example
val mapper = ObjectMapper ().registerKotlinModule()
val serialzied = mapper.writeValueAsString(url_details)
Deserialization example
val mapper = ObjectMapper ().registerKotlinModule()
val url_details = mapper.readValue< MutableMap < String , String >>(json)
First, git clone
$ git clone https://github.com/GINK03/kotlin-phantomjs-selenium-jsoup-parser.git
So far, two types of scraping have been implemented: simply scraping to a depth of 100 using breadth-first search without evaluating JavaScript.
(Since I was using it for scraping my own site, I didn't set any particular limits, but the default is 50 or more parallel accesses, so please adjust accordingly.)
$ sh run.scraper.sh widthSearch ${yourOwnSite}
Use Microsoft Bing to search on the image search screen. This is an experimental code to see if it is possible to get content drawn with Ajax without using the API, so I don't think it should be accessed in large quantities and cause trouble.
Please refer to the kancolle.txt file on github for the search list.
sh run.scraper.sh image ${検索クエリリスト} ${出力ディレクトリ}