|Posted on Friday, Apr 9, 2021 - 16:17: |
A preview version of X-Ways Forensics 20.3 is now available. The URL of the download directory for all recent versions can be retrieved by querying one's license status as always.
What's new in v20.3 Preview 1?
* The OCR capabilities of the software package Tesseract can now be utilized from within X-Ways Forensics and X-Ways Investigator. The package can be downloaded from our web server. Updated download instructions are available from the same place as always. If Tesseract is found by v20.3 in the subdirectory \Tesseract of the installation directory when v20.3 is first run, Tesseract will be activated automatically. Otherwise please go to Options | Viewer Programs to indicate the path.
* OCR can be applied as part of logical searches or indexing to suitable files such as document scans or digitally stored faxes in TIFF format or PDF documents that contain only graphic content. The default file masks includes even *.jpg, however, whether applying OCR to every JPEG file in a case is a little excessive or necessary is up to you to decide, and you have full control over the scope of the search using various means anyway. Please be aware that high-resolution photos cost a lot of time to check of text. Digital photos in JPEG and HEIC format will be rotated according to the instructions in the Exif metadata to restore the correct orientation and thus hopefully allow OCR of text that was originally photographed roughly horizontally. If the ordinary text decoding is already successful for a given file of a type that is contained in both file masks (*.pdf), OCR will not be applied additionally. The option "Store decoded text for context preview and future searches" will also keep text derived from OCR stored in the volume snapshot.
* Search hits returned by the logical search in OCR-derived text are identified as such in the Descr. column and highlighted in a different color. The Descr. filter allows you to list only such OCR search hits or not OCR hits. Older versions of X-Ways Forensics can see OCR search hits from v20.3 when opening the same case, but won't know that they are OCR search hits.
* You can select up to two languages for text recognition at the same time, after clicking the ... button for this in Options | Viewer Programs. However, there is a trade-off if you select Chinese/Japanese and a Western language at the same time. This will deteriorate the recognition of the Asian characters. You may want to select *only* Chinese/Japanese for much better recognition in that language. English (actually Latin) letters can still be recognized in that case, even if English is not expressly selected, at reduced quality. Select both Chinese/Japanese and a Western language at the same time only if correct recognition is more important to you in the Western language.
* Preview mode now has a separate submode in addition to Raw submode, called Text mode, in which pure text from non-picture files is extracted, just like for the logical search with the decode option. That submode can also be useful to better understand how text is extracted from various document types, in particular from spreadsheets, for which different extraction options exist that may differ in output, especially in formatting.
* If the ordinary text extraction/decoding in Text submode does not return any result or if the previewed file is a picture file, and if Tesseract is available and active, OCR will be applied. This allows you to better understand how well OCR will work in searches for the kind of files that you are dealing with. You can also experiment with different languages selected and compare the quality of the results. The submode button is named "Text" by default, but will change its label to "OCR" to make you aware that OCR is or was employed to retrieve the text. OCR can be time-consuming for multi-page TIFF and PDF files, but can be interrupted by the user if necessary. If a logical search or indexing has applied OCR to a file before and the result was stored in the volume snapshot, then the OCR-based preview will be available instantly and OCR will not be re-applied from scratch.
* Both submodes Raw and Text in Preview mode remain active until you leave Preview mode or select a file of a different type. If you prefer to make either of these submodes more persistent, so that it remains active even when previewing files of different types, you can hold the Shift key while clicking the respective submode button.
* The Tesseract package that is downloadable from our web server already has support for the following languages integrated, in alphabetic order:
chi_sim: simplified Chinese (horizontal writing only)
chi_tra: traditional Chinese (horizontal writing only)
jpn: Japanese (horizontal writing only)
kor: Korean (horizontal writing only)
Other languages can be added if you can find .traineddata files for them at https://github.com/tesseract-ocr/tessdata_fast. Such files simply need to be put into the \tessdata subdirectory of Tesseract. Or you can visit https://github.com/tesseract-ocr/tessdata_best to download higher quality OCR engines for any of the supported languages. (Please note that OCR takes considerably more time with them.)
* Supported file types are generally the following: PDF, PostScript (PS), TIFF, JPEG, HEIC, PNG, GIF, BMP, WEBP, AutoCAD DXF, Photoshop PSP, and maybe more.
* Ability to use the Descr. filter to focus on search hits in misaligned UTF-16 text.
* Ability to highlight search hits in alternative e-mail previews.
* Cyclic tab key order defined in the main window also in search hit list mode.
* Some minor improvements.
* Same fix level as v20.2 SR-1.
|Posted on Friday, Apr 9, 2021 - 20:15: |
Thank you....this is a great update
|Posted on Tuesday, Apr 13, 2021 - 20:14: |
* Slightly more complete OCR output for certain PDF documents.
* Some minor improvements.
|Posted on Thursday, Apr 15, 2021 - 11:30: |
Very nice Stefan. I will try and take a look as soon as I can and explore this great addition to XWF. I know OCR has been hoped for by many of us for some time.
|Posted on Friday, Apr 16, 2021 - 11:51: |
* New X-Tension API functions XWF_PrepareTextAccess() and XWF_GetText().
* Some fixes and minor improvements.
|Posted on Monday, Apr 19, 2021 - 7:19: |
* One fix and one minor improvement.
|Posted on Tuesday, May 4, 2021 - 19:43: |
* Compressed data chunks in NTFS-deduplicated files are now decompressed, i.e. such files can now be opened. Requires access to Windows 8 or later.
* New option to name carved files after the number of their respective first sector, either with or without leading zeroes.
* Ability to apply the Flex Filters to the additional columns of event lists, such as event timestamp and event description.
* In Ext file systems, a new volume snapshot option allows running a more in-depth parsing of deleted directory entries during the initial creation of the volume snapshot, even if they are misaligned in relation to the current directory entries. This might find additional previously existing files in Ext, at a likely manageable risk of finding some garbage entries as well. The checkbox for this is labeled "Ext: Try misaligned deleted dir entries".
* The file "GREP Expressions.txt", in which X-Ways Forensics recalls friendly names of your favorite regular expressions, is now named "Regular Expressions.txt". Please rename your existing file, if you have one.
* Several of the changes and fixes of v20.2 SR-3.