In the first two entries of this series, we saw the standard way ER for import works, but sometimes that’s just not enough. When we’re working with a file we want to parse, and we have, or think we could find, escape characters (for example, with CSVs or EDI messages), we must leverage the power of Regex in the format mapping.
In the last article, we found this file failed because our format import didn’t know that the comma in the third column should be escaped:
Let’s start by creating a new text import format, extending our model. In the format, let’s just parse the whole line. As we saw in the previous article, we’re going to give the ‘Seq’ node the special delimiter ‘New Line – CRLF’ and set the ‘Line’ multiplicity to ‘Zero many’.
Now, let’s select ‘Map format to model’, create a new mapping, and then select ‘Designer’.
What we want to do is to separate each non-escaped comma. The problem is, a comma is a very common character. So, our strategy will be to find all non-escaped commas and substitute them with a very unusual and unimportant symbol, then split them using that symbol. The symbol I’ve selected is §. First, let’s make sure that our lines don’t have that symbol, and if they do, let’s substitute it with a white space.
We’re going to create a new calculated field and use the replace formula. The first parameter is the text we want to substitute, the second is the substring we want to substitute, the third is the replacement value, and the final one is a boolean to indicate if we’re using Regex.
REPLACE(@.LineValue,"§"," ",false)
Now, let’s find all the non-escaped commas using Regex. Regex is a ‘language’ used to match patterns in a text. I’ve never been very good with it, but luckily now we have AI tools:
In ER formulas, we cannot use ” inside a text, but we can use CHAR(34):
REPLACE
(
@.LineValueSanitized,
",(?=(?:[^"&CHAR(34)&"]*"&CHAR(34)&"[^"&CHAR(34)&"]*"&CHAR(34)&")*[^"&CHAR(34)&"]*$)",
"§",
true
)
Now, let’s split all § and sanitize the field from unnecessary double quote:
1:
SPLIT(@.LineValueSubstituted, "§")
2:
IF
(
CONTAINS(@.Value,","),
REPLACE
(
@.Value,
"^"&CHAR(34)&"|"&CHAR(34)&"$",
"",
true
),
@.Value
)
Let’s find all fields using the formula INDEX. Additionally, for the last column, we should convert it to a date:
Field4: Calculated field = DATEVALUE(INDEX(@.LineFields, 4).ValueSanitized, "ddMMyyyy")
Field3: Calculated field = INDEX(@.LineFields, 3).ValueSanitized
Field2: Calculated field = INDEX(@.LineFields, 2).ValueSanitized
Field1: Calculated field = INDEX(@.LineFields, 1).ValueSanitized
Finally, let’s link the fields:
Let’s go back to the previous screen and test it with ‘Run’. If everything is correct, we can complete the format and select it as the default for model mapping.
Leave a Reply