DEV Community

loading...

Processing Fixed Width and Complex Files

tspannhw profile image Timothy Spann Originally published at datainmotion.dev on ・4 min read

Processing Fixed Width and Complex Files

Pointers

The first decision you will have to make is if it's structured at all. If it is a known type like CSV, JSON, AVRO, XML or Parquet then just use a record.

If it's semi-structured like a log file, GrokReader may work or ExtractGrok.

If it's like CSV, you may be able to tweak the CSV reader to work (say header or no header) or try one of the two CSV parsers NiFi has (Jackson or Apache Commons).

If it's a format like PDF, Word, Excel, RTF or something like that, I have a custom processor that uses Apache Tika and that should be able to parse it into text. Once it is text you can probably work with it.

Examples

Documentation

Processors To Use For File Manipulation

  • AttributesToCSV
  • AttributesToJSON
  • ConvertExcelToCSVProcessor
  • ConvertRecord
  • ConvertText
  • CSVReader
  • EvaluateJSONPath
  • EvaluateXPath
  • EvaluateXQuery
  • ExecuteScript
  • ExecuteStreamCommand
  • ExtractGrok
  • ExtractText
  • FlattenJson
  • ForkRecord
  • GrokReader
  • JsonPathReader
  • JsonTreeReader
  • JoltTransformJSON
  • JoltTransformRecord
  • LookupAttribute
  • LookupRecord
  • MergeContent
  • MergeRecord
  • ModifyBytes
  • ParseSyslog*
  • PartitionRecord
  • QueryRecord
  • ReaderLookup
  • ReplaceText
  • ReplaceTextWithMapping
  • ScriptedReader
  • ScriptedRecordSink
  • ScriptedTransformRecord
  • SegmentContent
  • SplitContent
  • SplitJson
  • SplitRecord
  • SplitText
  • SplitXml
  • SyslogReader
  • TransformXml
  • UnpackContent
  • UpdateAttribute
  • UpdateRecord
  • ValidCsv
  • ValidateRecord
  • ValidateXml

Custom Processors

Helper Projects, SDK, Libraries and Services

Discussion (0)

pic
Editor guide