Akamai Diversity

Akamai Security Intelligence & Threat Research

InfoSec experiment - Letting the CAT out of the bag

By Lukasz Orzechowski

If you work on an  Information Security team that gets customer questionnaires, you're likely familiar with Vendor Security Risk Assessment templates. We all care about information safety, and it is natural for our customers to want to check how well we are aligned with what they require internally, or with industry standards. We get a lot of questions and addressing them is our bread and butter.

One day, however, somewhere between the question about access control and the other one about password policy I had an idea - isn't the process of getting through questions and writing answers similar to what professional translators do, interpreting one language into another?

We basically translate questions into answers. We see a lot of a very similar asks for clarification in different questionnaires, just like translators see a lot of similar sentences in different texts. If this is so similar, could we perhaps use a Computer Aided Translation (CAT) tool to speed up the whole process?

I tested this hypothesis, and I want to share my experience with you.

Computer aided translation

The history of tools that assist with the translation process can be traced back to 1983, when ALPS (Automatic Language Processing System) was released. The underlying idea is that old translations of similar sentences can be reused for new translations to speed up the process and make the translation more consistent.

Computer Assisted Translation tools (or CAT tools) split source text into segments, usually based on sentences. Each translated segment is placed in a Translation Memory (TM) for a specific language pair. When the translator opens new segment, the CAT tool searches for similar source texts in the translation memory, and highlights differences between the current segment and the best match found in the TM.

The translator can then just copy translation and make necessary adjustments.

CAT tools made a big difference in the translation world. Every translation agency uses them, and before documents are sent to employees for further processing, they are usually pre-translated using corporate TMs.

Preparations

To start the experiment, I had to acquire two things. First, I needed an actual CAT tool. Then, to find matches and see if this whole idea was sound, I had to create my own translation memory with questions and answers.

The first one was easy. I wanted something open source that I could tweak to my needs if required and also something free as it was just an experiment. After some research, I decided to bet on OmegaT - the free translation memory tool. It met all the criteria, and additionally was a cross-platform solution, so we could use it on a Mac, Windows, or Linux if desired.

Figure 1 - OmegaT main window

 

Creating Translation Memory was more challenging. I exported about 2,000 questions and answers from our knowledge base, but a lot of them were multi-sentenced.

Why is this a problem? Well, in most TMs, we wish to have the smallest possible pair of source text and translation stored as an object. The small size makes it the most optimal for re-use. Aligning source and translation sentences in one-to-one relation is also typically rather easy.

In my case, I wanted this object to have the whole question and the whole answer stored. However, the questions are usually shorter than answers, and answers typically have many sentences. One-to-one pairing would not make a lot of sense in this situation. I needed a tool that would take the input row by row, instead of sentence by sentence. And again - something free to use and open source, if possible.

Surprisingly, many popular free tools for TM management failed miserably. They either lacked required options, didn't like size of the file, or produced Translation Memories with integrity problems.

This gave me an unlikely winner - Heartsome TMX Editor. It had its quirks, and required adding language information in the source file, but afterwards, it worked perfectly and created the TM I needed.

Figure 2. Heartsome TMX Editor

I had both the CAT tool and TM file at that point. It was time to check how it works.

Practice run

For the practice run, I had to use an actual questionnaire.

However, the moment I started deciding which to use, it became obvious that while questionnaires share the same principle, they don't share the same format. Some of them are web based, some are in spreadsheets, some in .doc files. There were questions based on industry frameworks, others in narrative, and some expecting Yes / No / N/A as an answer. There were questionnaires with just "Question" and "Answer" columns, however most of them had at least two levels with an additional context for each question.

It all meant that I couldn't have just opened a questionnaire in OmegaT and expected a good result. The tool would force me to "translate" questions, context, and default values altogether. Something dreadful had to happen first - pre-processing.

Pre-processing

I had to prepare the questionnaire to be addressed. I can hear you saying "just create copy of the questions column", but it wasn't that easy. Sometimes, there was no "questions" column. This was typical in document or narrative based questionnaires. Sometimes, even if it was in place, the structure of a spreadsheet file prevented selecting questions in an easy way. For example, a lot of the questionnaires were split into sections, with some cells joined every few rows. And that ruins convenient click and select process.

Preparing each document before "translation" manually was time consuming and was creating a big drag in the process. I continued for the sake of experiment, but it was an important downside and a significant finding.

After the document was cleaned out, I proceeded further.

 

Figure 3. OmegaT main window during the translation.

 

And it worked. To an extent. Not really. But, I'm getting ahead of myself.

Source text segmentation

Remember the challenges with creating our TM file? Well, the same segmentation rules were applied to the source text with questions. OmegaT assumed that I was translating it, and split each question into individual sentences.

Obviously, in our use case, this wasn't needed or helpful - I couldn't align my answer with individual sentences of the question. I could, for example, add the answer to the first sentence and then skip the rest until I've gotten to the next question, but then the  "translation" would be added to the TM accordingly, and the next time I would start getting strange matches, or no matches at all.

Fortunately, segmentation rules can be customized, and I could adjust them to my expectations. Unfortunately, it was yet another step in the process and yet another delay.

Fuzzy matches

When translator opens a segment (a source fragment), the CAT tool will automatically look for similar sentences in the translation memory. To not go too deep into technical details, the tool finds sentences where the same words in the same order were used. Each found sentence is given a score based on the similarity to the current one, and then these sentences are displayed in the Fuzzy matches window from the best aligned, in order.

It works great for translators. It even highlights differences between the current and the old segments. However, when the same approach is applied to questions, it doesn't help that much. And that's because the same question can be phrased very differently and questions that are very similar, when you consider the words used, can ask for a very different things.

That's why when we parse the knowledge base for old answers manually - we don't try to copy-search the whole sentence. We search for "access control", "role separation" or "system name". Basically, we look for specific keywords that are likely to appear in the prior questions.

Another thing that we need to consider is that the TM search mechanism searches for best matches only in the source text (or in our case in questions). However, it is even more likely that we find our keyword in previous answers, and this part of the language pair is not searched at all.

And this was not the last challenge.

Post-processing

Translators send back their customers translated text, not the language pairs. That's why source text in the translation process in OmegaT is replaced with the translation. So, in our case, the output file was a file with answers. To have the questionnaire addressed, I had to open the file and copy/paste answers in the proper place.

This added yet another step in the process, a post-processing. Again, the document structure was the factor deciding whether it was time consuming or not. Everything else than a single column of answers was adding drag to the process, as I had to pay close attention to align proper answer with the questionnaire question.

Fortunately, that was the final step that concluded my experiment. I have done it. But, was it worth the effort?

Conclusions

An experiment is "a scientific procedure undertaken to make a discovery, test a hypothesis, or demonstrate a known fact". It doesn't have to give us a definitive answer on how to address a problem to be useful.

Sometimes it gives us something else - a clear understanding of what we don't want to have, what are the gaps, what is inconvenient. We can use it to better understand what we need.

In the case of this experiment I learned that I don't want any pre-processing or post-processing. Segmentation is a great concept, but in our case, not very useful. We must search for keywords, not for similar sentences. We need to search both in questions and answers, not only in the source text.

I learned that what we need is basically a plugin that can operate on any document without changing its structure. The plugin should connect to the knowledge base, and allow us to search it after we select a word or couple of words or enter the text manually.

Likewise, the plugin should remember what we searched for as keywords for the future, and display search results every time we are processing a sentence that has any of these keywords. It should enable us to copy / paste prior answers.

I guess I have my pet project for the next year defined.  

Leave a comment