Kenneth Rohde Christiansen

Saturday, October 22, 2005

Uh.. what a boring saturday

So, my plan was to relax today and sleep really long. Unfortunately, I woke up quite early taking into account that I went out yesterday night. I wasn't really supposed to, but when your friends invite you :) you don't say no, do you?

So, I also didn't really get to relax. I donno... but it is quite hard for me relaxing when I know that there are lots of things that I need to finish. Instead of doing something school related, I did some work-related work. I am currently working on putting a dictionary online, but I cannot say much more than that.

Anyway, I first got samples in one format and developed some tools to transform it into a structured format (XML) and another tool to dump the data to the database. But the authors only have the final version as a PDF :( Well, there is an option to export to XML, but unfortunately it gives me text filled with weird spaces, weird letters and randomly places XML tags. Scheisse....

The file is around 6-7 MB so I didn't really feel doing this manually :) and instead spend today (and most of yesterday) figuring out the different structures of entries and developing some regular expression rules that can convert the file together with some Perl code. It seems to work quite well, but some places it has gone totally wrong (leftover from the PDF->XML conversion) and I have to clean that up manually; bummer!

Complicated regular expressions can give you a hard time, but man, they save you a lot of time as well :) Doing this work manually would take me around a year.


