kotlin headlessbrowser selenium jsoup parser Download - kotlin headlessbrowser selenium jsoup parser Source code download

kotlin headlessbrowser selenium jsoup parser

Other categories

1.0.0

Download

?Scraping with Kotlin?

Figure 1. Abukuma's mosaic art made from KanColle images scraped with Kotlin.

Partial migration from Python to Kotlin @Machine Learning Engineer's Perspective

Python is a useful language. However, it can go wrong due to the scripting language's lack of strict type evaluation and some expensive operations. Based on my personal experience, I have often been troubled by irreproducible errors in Python in programs where the analysis target becomes large and parallelism is required. I casually tried Kotlin and found it to be quite easy to use, so I tried porting the Scraper that was implemented in Python3.
(By the way, I have never really touched Java)

Scraper using Python thread and multiprocess

Figure 2. Access method of Scraper that I have always used in Python

In Python, there is a distinction between multi-process and Thread, and Thread does not become multi-core (this trick is necessary because only 1 CPU can be used).

This is a Kotlin newbie, but Python threads and Kotlin threads seem to behave differently.
Although Kotlin's Thread does not split the process, the CPU usage rate exceeds 100%, so it seems that efficient threading is being performed using multiple CPUs. (In other words, after dividing with Multiprocess, there seems to be no need to run Thread under it)

Installing Kotlin

This is how to install on Ubuntu.

$ curl -s https://get.sdkman.io | bash
$ sdk install kotlin

increase memory

It seems that the JVM memory is recorded in the environment variable JAVA_OPT, and if you use it normally, it will crash due to lack of memory, so it would be better to fix it in a modern way. I have the settings like this.

JAVA_OPTS= " -Xmx3000M -Xms3000M "

How to compile and execute with CUI

I'm not good at Java, and I was thinking of pursuing a career plan that would avoid Java as much as possible, but I feel like it would be difficult to learn the tools themselves, mainly Eclipse and IDE. Because there were a lot of things.
It would be convenient to use Kotlin with an IDE, but I think CUI is fine as long as there are no problems compiling and running with CUI.
There are many ways to compile, but we found it more usable to compile it into a jar file, including the runtime.

$ kotlinc foo.kt -include-runtime -d foo.jar

Now you can compile.
You can combine multiple files into a jar. (You can refer to bar.kt's functions and Class in foo.kt)

$ kotlinc foo.kt bar.kt -include-runtime -d foo.jar

It can be used by adding a jar file that can be compiled using Java's Maven etc. to the classpath. (Suppose you use files alice.jar and bob.jar)
This is very helpful as it allows us to reuse many Java assets.

$ kotlinc foo.kt bar.kt -cp alice.jar:bob.jar -include-runtime foo.jar

For example, when executing a kotlin jar using an external jar file, the command will be like this.

$ kotlin -cp alice.jar:bob.jar:foo.jar FooKt

This name FooKt seems to be used to specify the foo.kt file that contains the main function.

Sites using JavaScript can easily combine phantomjs, selenium, and jsoup.

When there is asynchronous data loading using JavaScript, if you simply retrieve it and analyze it with jsoup etc., you will not be able to get the content.You have to run JavaScript to create a state similar to what a human would see. , run phamtomjs via selenium to make JavaScript work For example, Microsoft Bing's image search is rendered with Ajax and cannot work in an environment where JavaScript does not work. (This is for experimental purposes, so when actually scraping images, please do so via the API.)

    val driver = PhantomJSDriver ()
    driver.manage().window().setSize( Dimension ( 4096 , 2160 ))
    driver.get( " https://www.bing.com/images/search?q= ${encoded} " )
    //すべての画像が描画されるのを待つ
    Thread .sleep( 3001 )
    val html = driver.getPageSource()

The html variable will contain the rendered html after JavaScript runs. By putting this in jsoup, you can find the src URL of various images. Based on the URL of the image you found, use the wget command to store it in a folder under any directory.

    val doc  = Jsoup .parse(html.toString(), " UTF-8 " )
    println (doc.title())
    doc.select( " img " ).filter { x ->
       x.attr( " class " ) == " mimg "
    }.map { x ->
       val data_bm = x.attr( " data-bm " )
       val src = x.attr( " src " )
       Runtime .getRuntime().exec( " wget ${src} -O imgs/ ${name} / ${data_bm} .png " )             
    }

PhantomJS needs to be downloaded from this site and placed in the PATH.

Thread

There seem to be several ways to write this, but this is the easiest implementation.
The entire logic to be scraped enclosed in {} becomes a thread instance, and you can start or join that thread to run it in parallel.

    val threads = url_details.keys.map { url ->
      val th = Thread {
        if (url_details[url] !! == "まだ" ) { 
          _parser (url).map { next ->
            urls.add(next)
          } 
          println ( "終わりに更新 : $url " )
          url_details[url] = "終わり"
          // save urls
          _save_conf ( mapper.writeValueAsString(url_details) )
        }
      } 
      th
    }

Serializing and deserializing objects

It seems that a serialization module called jackson can be used for a limited time.
It seems that the Java library alone does not work, and you need to load the module for Kotlin separately.
By limited, I tried serializing and deserializing MutableMap<String, DataClass> and it didn't work.
MutableMap<String, String> works fine, so I'm not sure if the nested structure is bad or if it doesn't support Data Class.
Serialization example

 val mapper = ObjectMapper ().registerKotlinModule()
val serialzied = mapper.writeValueAsString(url_details)

Deserialization example

 val mapper = ObjectMapper ().registerKotlinModule()
val url_details = mapper.readValue< MutableMap < String , String >>(json)

Try scraping

First, git clone

$ git clone https://github.com/GINK03/kotlin-phantomjs-selenium-jsoup-parser.git

breadth-first search

So far, two types of scraping have been implemented: simply scraping to a depth of 100 using breadth-first search without evaluating JavaScript.
(Since I was using it for scraping my own site, I didn't set any particular limits, but the default is 50 or more parallel accesses, so please adjust accordingly.)

$ sh run.scraper.sh widthSearch ${yourOwnSite}

image search

Use Microsoft Bing to search on the image search screen. This is an experimental code to see if it is possible to get content drawn with Ajax without using the API, so I don't think it should be accessed in large quantities and cause trouble.
Please refer to the kancolle.txt file on github for the search list.

sh run.scraper.sh image ${検索クエリリスト} ${出力ディレクトリ}

Expand

Additional Information

Version 1.0.0
Type Other categories
Update Time 2025-01-11
size 50MB
From Github

Related Applications

jsoup HTML parser v1.17.2

2024-11-13
docker selenium

2024-11-10
GitHub sgrebnov/cordova plugin background download

2024-11-05
selenium

2024-11-02
Wa ch the greatest of all time 2024 ull ovie Online For Fr e Strea ings At Home

2024-11-02
wolfs 2024 f llmo ie f lmyz lla dow load ree 7 0p 4 0p a d 10 0p

2024-11-01

Recommended for You

chat.petals.dev

Other source code

1.0.0
GPT Prompt Templates

Other source code

1.0.0
GPTyped

Other source code

GPTyped 1.0.5
catalogonline

Other categories

1.0.0
ProEventos App

Other categories

1.0.0
MichaelBrandonMorris.KingsportMillSafetyTraining

Other categories

1.0.0
waymo open dataset

Other source code

December 2023 Update
termwind

Other categories

v2.3.0
wp functions

Other categories

1.0.0

Related Information All