Home
Categories
Dictionary
Download
Project Details
Changes Log
What Links Here
How To
Syntax
FAQ
License

Performance



The parsing and HTML generation is rather fast if you don't check the validity of external links.

For example on my PC it takes less than 7 seconds to parse this wiki and perform the generation without checking the external links[1]
Note that the parsing speed will increase significantly if you perform the parsing several times and use the same GUI interface, see running the generation several times
.

Overview

Main Article: Algorithm

The parsing and writing algorithm has the following steps:
  • Pre-parsing identifies the list of files to parse and their types
  • Parsing parse each file identified in the previous step. By default the parsing is performing in the main Thread, but it is possible to perform the parsing in background threads
  • Resolving resolves the inter-wiki links
  • Checking links checks the validity of the external links. This step is performed in background threads at the same time as the resolving
  • Writing writes the content of the wiki

Speeding-up the parsing

There are two set of options which control the speed of the parsing:
  • Options which control the way links checking is performed. The links checking is performed in a background threadPool
  • Options which allow to perform the parsing in background threads
  • Options which allow to perform the parsing and the generation in the incremental generation mode


Several options are available to configure the way links background threads are spawned, and links validity are checked:
  • checkHTTPLinksTimeOut: set the timeout in seconds of the Thread pool used for checking URL links with the "http" protocol, if the check is performed in background Threads. The default is 10 seconds
  • checkHTTPLinksPool: set the number of Threads in the Thread pool used for checking URL links with the "http" protocol. The default is 30. See checkHTTPLinksPool parameter for more information on how changing this parameter can reduce performance
  • "defaultHTTPTimeout": set the timeout for checking the availability of an http link (500 ms by default)
  • "checkHTTPLinksTimeOut": set the timeout of the Thread pool used for checking URL links with the "http" protocol, if the check is performed in background Threads
  • timeouts: specifies the XML file setting specific timeouts for URLs (useful only if the "checkLinks" property is set to true)
By default the parsing is performed in the main Thread, but it is possible to perform the parsing in background threads[2]
technically in a fork-join pool, using a kind of MapReduce algorithm
:
  • "forkParser": specifies if the parsing will be performed in background threads
  • "forkParserSplit": specifies the maximum number of files to parse per thread (useful only if the "forkParser" is set to true)

CheckHTTPLinksPool parameter

Main Article: checkHTTPLinksPool

The checkHTTPLinksPool parameter set the number of Threads in the Thread pool used for checking URL links with the "http" protocol. The default value is 30.

Each spawned thread will check all the links for one base URL[3]
The link URL without the ref part. For example, the base URL for "http://my/file.html#title" would be "http://my/file.html"
, to avoid to check the existence of the same HTTP URL more than once.

The validation of external HTTP links can take more time (depending on the number of links you want to check), so prepare to increase the parsing time if the checkHTTPLinks property is set to true (which is the default). For example, for this wiki the time for the parsing and generation increase to 6 seconds if external links are checked.

Note that if you reduce drastically the number of Threads, you will reduce the peformance if there are a lot of links to check[4]
This will only be the case of course if the checkHTTPLinks parameter is not set to false
.

For example if you set this parameter to 1, you will have:
      Parsed 122 articles (134 files) in 6,8 seconds
      Including Pre-parsed in 0,3 seconds
      Resolved in 10,3 seconds
      Performing http links checks in 9,8 seconds
      Writing site in 0,6 seconds
      Completed generation in 17,8 seconds
rather than this result with the default value:
      Parsed 122 articles (134 files) in 7,4 seconds
      Including Pre-parsed in 0,3 seconds
      Resolved in 2,4 seconds
      Performing http links checks in 2 seconds
      Writing site in 0,5 seconds
      Completed generation in 10,3 seconds

Example

Note that by default links will be chedk in background threads with the folowing options:
  • "checkHTTPLinksTimeOut" is set to 10 seconds
  • "checkHTTPLinksPool" is set to 30
  • "defaultHTTPTimeout" is set to 300 ms


The following example overrides some of these parameters:
   <java classname="org.docgene.main.DocGenerator">
      <arg value="-input=wiki/input"/>
      <arg value="-output=wiki/output"/>
      <arg value="-search=titles"/>
      <arg value="-checkHTTPLinksTimeOut=7"/>
      <arg value="-checkHTTPLinksPool=50"/>
      <classpath>
         <pathelement path="docGenerator.jar"/>
      </classpath>
   </java>

ForkParser parameter

The forkParser parameter allows to perform the parsing in background threads[2]
technically in a fork-join pool, using a kind of MapReduce algorithm
. The forkParserSplit parameter specifies in this case how many files will be parsed in one Thread (the default is 20).

Settings this option can reduce the parsing time up to 50%.

For example if you set this option to true, you will have:
      Parsed 122 articles (134 files) in 3,9 seconds
      Including Pre-parsed in 0,3 seconds
      Resolved in 0,3 seconds
      Performing http links checks in 0,3 seconds
      Writing site in 2,4 seconds
      Completed generation in 6,7 seconds
rather than the default:
      Parsed 122 articles (134 files) in 6,5 seconds
      Including Pre-parsed in 0,3 seconds
      Resolved in 2,4 seconds
      Performing http links checks in 2,1 seconds
      Writing site in 0,5 seconds
      Completed generation in 9,6 seconds

Example

The following example enables the parsing in background threads:
   <java classname="org.docgene.main.DocGenerator">
      <arg value="-input=wiki/input"/>
      <arg value="-output=wiki/output"/>
      <arg value="-search=titles"/>
      <arg value="-forkParser=true"/>
      <classpath>
         <pathelement path="docGenerator.jar"/>
      </classpath>
   </java>

Incremental generation mode

Main Article: Incremental generation

It is possible to perform the generation in the incremental generation mode rather than to parse and generate all the content of the wiki. You need to set the "updateMode" command-line argument or configuration property.

The incremental generation mode allows to decrease dramatically the time necessary to perform the generation.

Show detailed statistics about the performance

The -showDetailedGenerationTimes command-line option allows to show detailed statistics about the generation time. For example:
      java -jar docGenerator.jar -input=wiki/input -output=wiki/output -showDetailedGenerationTimes=true
By default, you will see for example on the console:
      Generated in 10,3 seconds
      Generated wiki from D:\Java\docGenerator\code\wiki\input to D:\Java\docGenerator\code\wiki\output
With this option enabled, you will see for example on the console:
      Parsed 122 articles (134 files) in 7,4 seconds
      Including Pre-parsed in 0,3 seconds
      Resolved in 2,4 seconds
      Performing http links checks in 2 seconds
      Writing site in 0,5 seconds
      Completed generation in 10,3 seconds
      Generated wiki from D:\Java\docGenerator\code\wiki\input to D:\Java\docGenerator\code\wiki\output

PageRank

Main Article: PageRank

Using the PageRank algorithm will show the number of iterations of the algorithm and the time spent by this algorithm. For example:
      Parsed 182 articles (199 files) in 4,8 seconds
      Including Pre-parsed in 0,3 seconds
      Resolved in 0,1 seconds
      Including processing PageRank in 2 milliseconds for 6 iterations
      Writing site in 1 seconds
      Completed generation in 5,9 seconds
      Generated wiki from L:\WRK\Java\docgenerator\wiki\input to L:\WRK\Java\docgenerator\wiki\output

Running the generation several times

The parsing speed will increase significantly if you perform the parsing several times and use the same GUI interface.

For example, on my PC, the first time with the default options, I have the following result:
      Parsed 122 articles (134 files) in 5,8 seconds
      Including Pre-parsed in 0,3 seconds
      Resolved in 0,3 seconds
      Performing http links checks in 0,3 seconds
      Writing site in 2,1 seconds
      Completed generation in 8,3 seconds
but the second time I have:
      Parsed 122 articles (134 files) in 2,3 seconds
      Including Pre-parsed in 97 milliseconds
      Resolved in 69 milliseconds
      Performing http links checks in 53 milliseconds
      Writing site in 1,3 seconds
      Completed generation in 3,7 seconds
and the third time:
      Parsed 122 articles (134 files) in 1,5 seconds
      Including Pre-parsed in 69 milliseconds
      Resolved in 18 milliseconds
      Performing http links checks in 0 milliseconds
      Writing site in 1,3 seconds
      Completed generation in 2,8 seconds

Notes

  1. ^ Note that the parsing speed will increase significantly if you perform the parsing several times and use the same GUI interface, see running the generation several times

  2. ^ [1] [2] technically in a fork-join pool, using a kind of MapReduce algorithm
  3. ^ The link URL without the ref part. For example, the base URL for "http://my/file.html#title" would be "http://my/file.html"
  4. ^ This will only be the case of course if the checkHTTPLinks parameter is not set to false

See also


Categories: configuration | general

docJGenerator Copyright (c) 2016-2023 Herve Girod. All rights reserved.