README.md 2.8 KB
Newer Older
Harold Carrel Billiard's avatar
Harold Carrel Billiard committed
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Converter : Wikipedia_to_Hdoc
---------------

Licence : GPL 3.0
---------------

Credits :
---------------
Carrel Billiard Harold

Harriga Merouane

Lhomme Nicolas



17 18 19 20 21 22 23 24 25
Getting started 
---------------

Use a terminal and go to the root of the folder (Wikipedia_to_hdoc).

Generating .hdoc of a Wikipedia article with an URL
---------------------------------------------------

1 - Run the comand corresponding to your OS
Nicolas Lhomme's avatar
Nicolas Lhomme committed
26
        
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
        On windows : 
            runURL.bat yourWikipediaUrl yourFilename
                yourWikipediaUrl is the Wikipedia URL
                yourFilename is the name of the directory in which output files will be placed
                
            For instance : runURL.bat https://fr.wikipedia.org/wiki/Constructeur_(programmation) constructeur
        
        On Linux : 
            sh runURL.sh yourWikipediaUrl yourFilename
                yourWikipediaUrl is the Wikipedia URL
                yourFilename is the name of the directory in which output files will be placed
            
            For instance : sh runURL.sh https://fr.wikipedia.org/wiki/Constructeur_(programmation) constructeur
            
2 - Get the .hdoc in the output/yourFilename folder

Generating .hdoc of a Wikipedia article with a local file
---------------------------------------------------------

lhommeni's avatar
lhommeni committed
46
1 - Copy the content of the Wikipedia article you want to convert in the directory named "input" and in a file called "source.xml".
Nicolas Lhomme's avatar
Nicolas Lhomme committed
47

48
2 - Run the comand corresponding to your OS
Nicolas Lhomme's avatar
Nicolas Lhomme committed
49 50 51
        
        
         windows : 
52 53 54 55 56
            runFile.bat
        
        On Linux : 
            sh runFile.sh
                       
57 58 59 60
3 - Get the .hdoc in the output/source folder

To do
---------------------------------------------------------
lhommeni's avatar
lhommeni committed
61
Concerning images :
62

lhommeni's avatar
lhommeni committed
63
1 - Extract the metadata information from the meta.xml file for each image. You can do that by creating an XSL file that will be called from the ant task generated by xslt/get_ressources_urls.xsl. In that file you have the hand on each meta.xml File.
Harold Carrel Billiard's avatar
Harold Carrel Billiard committed
64

65
2 - Verify that images are well zipped to avoid any problem while testing in Opale
Harold Carrel Billiard's avatar
Harold Carrel Billiard committed
66

67 68
3 - Images inside paragraphs break the validation of the hdoc schema, do a preposition to change the schema and handle that.

Harold Carrel Billiard's avatar
Harold Carrel Billiard committed
69

lhommeni's avatar
lhommeni committed
70 71 72 73
Concerning listings :

1 - Succeed in finding the language of the part of code of the wikipedia article

haroldcb's avatar
haroldcb committed
74 75 76 77

Concerning tables : 

1 - Solve the encoding problem,
Harold Carrel Billiard's avatar
Harold Carrel Billiard committed
78

haroldcb's avatar
haroldcb committed
79
2 - Change Hdoc Scheme in order to accept images in tables?
Harold Carrel Billiard's avatar
Harold Carrel Billiard committed
80 81 82

3 - Display complex tables as tables in Opale (not as extern files)

haroldcb's avatar
haroldcb committed
83
 
84 85
Be aware of the following things
---------------------------------------------------------
harriga_merouane@hotmail.fr's avatar
Merge  
harriga_merouane@hotmail.fr committed
86
1 - Not all images have a metadata information (only the ones who )
Nicolas Lhomme's avatar
Nicolas Lhomme committed
87
2 - The title of the images have a metadata information (only the ones who are not included in the text)
88 89 90

BUG
---
Nicolas Lhomme's avatar
Nicolas Lhomme committed
91
1 - Linux sh files doesn't work with UTC proxy but works outside UTC