Converter : Wikipedia_to_Hdoc =============== Licence : --------------- GPL 3.0 http://www.gnu.org/licenses/gpl-3.0.txt Credits : --------------- Carrel Billiard Harold Harriga Merouane Lhomme Nicolas Previous developers Presentation --------------- This converter transforms a wkipedia page (from a link or a saved page) to a Hdoc document. User Documentation --------------- Use a terminal and go to the root of the folder (Wikipedia_to_hdoc). Generating .hdoc of a Wikipedia article with an URL --------------------------------------------------- 1 - Run the comand corresponding to your OS On windows : runURL.bat yourWikipediaUrl yourFilename yourWikipediaUrl is the Wikipedia URL yourFilename is the name of the directory in which output files will be placed For instance : runURL.bat https://fr.wikipedia.org/wiki/Constructeur_(programmation) constructeur On Linux : sh runURL.sh yourWikipediaUrl yourFilename yourWikipediaUrl is the Wikipedia URL yourFilename is the name of the directory in which output files will be placed For instance : sh runURL.sh https://fr.wikipedia.org/wiki/Constructeur_(programmation) constructeur 2 - Get the .hdoc in the output/yourFilename folder Generating .hdoc of a Wikipedia article with a local file --------------------------------------------------------- 1 - Copy the content of the Wikipedia article you want to convert in the directory named "input" and in a file called "source.xml". 2 - Run the comand corresponding to your OS windows : runFile.bat On Linux : sh runFile.sh 3 - Get the .hdoc in the output/source folder To do --------------------------------------------------------- Concerning images : 1 - Extract the metadata information from the meta.xml file for each image. You can do that by creating an XSL file that will be called from the ant task generated by xslt/get_ressources_urls.xsl. In that file you have the hand on each meta.xml File. 2 - Verify that images are well zipped to avoid any problem while testing in Opale 3 - Images inside paragraphs break the validation of the hdoc schema, do a preposition to change the schema and handle that. Concerning listings : 1 - Succeed in finding the language of the part of code of the wikipedia article Concerning tables : 1 - Solve the encoding problem, 2 - Change Hdoc Scheme in order to accept images in tables? 3 - Display complex tables as tables in Opale (not as extern files) Be aware of the following things --------------------------------------------------------- 1 - Not all images have a metadata information (only the ones who ) 2 - The title of the images have a metadata information (only the ones who are not included in the text) BUG --- 1 - Linux sh files doesn't work with UTC proxy but works outside of UTC. 2 - Random errors might occur Wikipedia is a great tool : everyone can participe. However, it does not provide contributors with best practices that everyone follows. The result is a lot of different ways to write articles. This is why this converter might not handle some situations (even if all files I have tried worked), and it might not be able to output some Wikipedia articles at its current state. 3 - Small issues with Opale Links can be invisible if you use an old version of Opale. This is not a problem coming from the Wikipedia to Hdoc converter. Make sure you use an updated version of Opale to test your scar archives. Another thing is that Opale might indicate that the scar file contains errors once imported. Actually, these "errors" are warnings. The archives work, as they were validated when making the scar file. These warnings come from Opale, but you can ignore them.