So why do I choose to implement native CSV parsing? CSV is one of the most popular file formats used in the machine learning dataset and is often delivered as the default format in competitions such as Kaggle. CSV files have the same schema for each example, while VW text format doesn't. Although converters in Python and Perl have been written which convert these files to VW text format, it would be convenient if VW could understand CSV files natively. I also want to challenge myself, as there is a surprising amount of complexity in the design of implementing a generalized parser for CSV, so this project is as much about considering all of the design pieces as implementing a working parser. My choice and time devoted also pay off since my project has already got merged into the VW upstream !!!
CSV files are often separated by commas (
,) or tabs. however, alternative delimiter-separated files are often given a
.csv extension despite the use of a non-comma field separator. This loose terminology can cause problems in data exchange. Many applications that accept CSV files have options to select the delimiter character and the quotation character. Semicolons (
;) are often used instead of commas in many European locales in order to use the comma (
,) as the decimal separator and, possibly, the period (
.) as a decimal grouping character.
Separating fields with the field separator is CSV's foundation, but commas in the data have to be handled specially.
The short answer is that we follow the RFC 4180 and MIME standards.
The 2005 technical standard RFC 4180 formalizes the CSV file format and defines the MIME type "text/csv" for the handling of text-based fields. However, the interpretation of the text of each field is still application-specific. Files that follow the RFC 4180 standard can simplify CSV exchange and should be widely portable. Here are its requirements:
- Lines will end with CR or CRLF characters, and it is optional for the last line.
- CSV files can have an optional header record. There is no sure way to detect whether it is present, so care is required when importing.
- Each record should contain the same number of separated fields.
- Any field may be quoted with double quotes.
- Fields containing a double-quote or commas should be quoted.
- If double-quotes are used to enclose fields, then a double-quote in a field must be represented by two double-quote characters.
- Allows specifying the CSV field separator by --csv_separator, default is
:are reserved and not allowed to use, since the double quote (
") is for escape, vertical bar(
|) for separating the namespace and feature names,
:can be used in labels.
- For each separated field, auto remove the outer double-quotes of a cell when it pairs.
--csv_separatorsymbols that appeared inside the double-quoted cells are not considered as a separator but a normal string character.
- Double-quotes that appear at the start and end of the cell will be considered to enclose fields. Other quotes that appear elsewhere and out of the enclose fields will have no special meaning. (This is also how Microsoft Excel parses.)
- If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote, and will remove that escape symbol during parsing.
- Use header line for feature names (and possibly namespaces) / specify label and tag using
_tagby default. For each separated field in header except for tag and label, it may contain namespace and feature name separated by namespace separator, vertical bar(
--csv_headerto override the CSV header by providing the header. Combined with
--csv_no_file_header, we assume that there is no header in the CSV file and under such condition specifying
--csv_headerfor the header is a must.
- If the number of the separated fields for current parsing line is greater than the header, an error will be thrown.
- Trim the field for ASCII "white space"(
\r\n\f\v) as well as some UTF-8 BOM characters(
\xef\xbb\xbf) before separation.
- If no namespace is separated, will use empty namespace.
- Separator supports using
\tto represent tabs. Otherwise, if assigning more than one character, an error will be thrown.
- Directly read the label as string, interpret it using the VW text label parser.
- Will try to judge if the feature values are float or string, if NaN, will consider it as a string. quoted numbers are always considered as strings.
- If the feature value is empty, will skip that feature.
- Reset the parser when EOF of a file is met (for possible multiple input file support).
- Support using
--csv_ns_valueto scale the namespace values by specifying the float ratio. e.g. --csv_ns_value=a:0.5,b:0.3,:8, which the namespace a has a ratio of 0.5, b of 0.3, empty namespace of 8, other namespaces of 1.
- If all the cells in a line are empty, then consider it an empty line. CSV is not a good fit for the multiline format, as evidenced by many empty fields. The multiline format often means different lines have different schemas. However, I still leave the empty line support to ensure that it's flexible and extendable. In this case, users can still express multiline examples in CSV files, although it is not listed as supported. We still throw an error if the number of fields separated by the line doesn't match the previous, even if all the fields are empty, as this usually means typos that users may not intend.
The project reaches 100% Test and Code Coverage. On my computer, it takes only 829 ms to handle 200000 examples, which is comparable to the equivalent VW custom format data file’s parsing speed of 623 ms, and is equivalent to 21.7 MB/s throughput.
I have expected that little bit of slower between VW custom text format and the CSV format. The VW text parser can parse as it reads since generally the elements have a fixed position, while CSV parser needs to store the split elements into an array, and look up that array according to the header for labels, tags, namespace and feature values, also there's double quotes trimming and escape support, which will definitely cost more time.
After all, I have tried my best to optimize it and the performance is also to my satisfaction. You can check using the throughput tool for the unoptimized one during the project. And that was actually 10 times improvement!
- VW text Parser:
2239418 bytes parsed in 60581μs 36.9657MB/s
- Unoptimized CSV Parser:
1799450 bytes parsed in 912927μs 1.97108MB/s
- Optimized CSV Parser:
1799450 bytes parsed in 87728μs 20.5117MB/s
My optimization mainly involves turn all the functions used by the parser into force inline. During the parsing, instead of parsing column by column, I categorize all the features by its namespace and parse those features by namespace to avoid double caculating the namespace hash. I also replace the map with unordered map, use vw's own data structure v_array to replace
std::vector, and use VW's realization of parsefloat to replace
std::stof, which is much faster. The statistics show that these really improved the performance a lot.
That's all, thanks for listening, and also special thanks to my mentors Jack Gerrits and Peter Chang from Microsoft Research - New York City Lab, who really helped me a lot during the project.