Hi ?
This can be accomplished without a loop as I see the data in your sample input file is highly structured. In summary, three things need to happen:
(1) Remove headers and footers between pages
(2) Extract data points
(3) Clean up dates
(1) Can be done using the Replace AC with the following regex expression, replacing the match with an empty string:
(?m)continues.{3}[.swW]*?continued.{3}
(2) Can be done using the Find AC with the following regex expression and with capture groups enabled:
(?m)(d+)s+(d+)s+(d{2}/d{2}/d{4})s+(d+)s+([ws]+?(?=s{2}))s+(d+)s+([d,.]+)s+(w)s+([d.]+)s+([d.]+)s+([d.]+)s+([d.]+)s+(d+)s+(d+)s+(d{2}/d{2}/d{4})s+([ws]+?(?=s{2}))s+(d+)s+(d+)
(3) Can be done with any of the other ACs where you convert dd/mm/yyy into yyyymmdd as per your sample output.
I've attached here an LWIZ which you can import into your Studio, a screenshot of the code, and sample output file produced by the code (I've not include date reformatting code for ease of reading, only regex ones)
Final notes:
- Your own sample output file has more line items than your sample input file, therefore the sample output I posted will have fewer rows than yours.
- The final line item your sample input is truncated, therefore this line is not included in the sample output. However, in real life it wouldn't be truncated so I do not see this being a problem in a production scenario.
- Please test the regex expressions on a wider variety of data. Using the limited set provided I cannot be sure whether single digit fields, for example, are a maximum or 1 char or can be more. Adjust the regex expression accordingly to ensure the field data is captured.
- You can visually test out the expressions on https://regex101.com.
I hope this helps, don't hesitate to contact me if you need assistance!
Thanks,
Darren
Hi Darren... This is one of the best solutions which I have ever got in REGex. Thank you so much dear Darren. Let me know if there is any portal or channel to rate your solution ?
Its perfectly working in the https://regex101.com/ but I got some slight mismatch in the excel, I will try the code from my end too.
Meanwhile, it will be helpful if you let me know how to import that attached .iwiz file.
However, here I am attaching the original extracted result from PDF with the expected output. It will be highly helpful if you assist me to align this text file, like the output excel attached.
I will apply the same kind of logic for other templates too as we are having more formats.
Thank you so much dear Darren.
You're welcome.
As far as I can tell, there are only a couple of differences between my output and yours:
- dates are in the wrong format
- headings are missing
Both are easily addressed.
Did you see any other discrepancy not in the above? What was the "slight mismatch" you mentioned?
To import an LWIZ file you need to create a new wizard, open the Kryon Studio editor and in the menu click Wizard > Open Local File, then select the LWIZ file which will then import all the steps.
If you cannot open the LWIZ you may be using an older version of Kryon than myself (I am using v20.3). Let me know and I will send a version for v19.1 (the trial version of Studio)
Thanks,
Darren
Please see attached a new version of the LWIZ, this should output the data exactly as per your output example ? complete with headers and formatted dates
Dear Darren, the code is working flawlessly in my machine. I am also using 20.3. Thank you so much for your valuable time and help. Your RegEx knowledge is A1-level. I really appreciate it. I also started following you in the forum and I will go through all your valuable posts.
I am gonna
apply this solution for few other templates also. On stuck, I will try and if
still persists, I will reach you dear Dareen ?
No problem 🙂 There is plenty to find on KryoNet, especially inside the Knowledgebase.
If you have time also, why not sign up for the Kryon Bot Camp (hackathon) at the top of the page? I'm sure there will be many interesting submissions to learn from
Sure Dareen. Registered already ? .
Completed 2 bots. One is to extract the handwritten cheque data (including Arabic) using Google API and the another one is signature validation. Preparing mp4 and planning to submit this by this Sunday.