8000 Import type "array"? · Issue #496 · arangodb/arangojs · GitHub
[go: up one dir, main page]

Skip to content

Import type "array"? #496

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Simran-B opened this issue Jan 15, 2018 · 8 comments
Closed

Import type "array"? #496

Simran-B opened this issue Jan 15, 2018 · 8 comments
Labels
Bug A code defect that needs to be fixed. Discussion Warrants further discussion. Probably an incomplete feature request.

Comments

@Simran-B
Copy link
Contributor

For imports, there is a check type !== "array":

isJsonStream: Boolean(!opts || opts.type !== "array"),

https://github.com/arangodb/arangojs/blob/master/src/collection.ts#L351

According to the HTTP API docs, only the following are supported: auto, documents, list.

Either the documentation is lacking, or it needs to be changed to list in arangojs.

@jsteemann
Copy link
Contributor

"documents", "array", "list", "auto" are the valid values for "type" on the server side.
I think all of them will interpret the input data as line-wise JSON.
Any other value for "type", or lack of the "type" parameter, will trigger the CSV import. Which will also expect line-wise JSON, but interpret things differently: first line is expected to be a JSON array with the column names, the following lines are expected to be JSON arrays with the values.

@Simran-B
Copy link
Contributor Author

The actual behavior of the server is somewhat confusing and/or the documentation is incorrect.

  • type (required): Determines how the body of the request will be interpreted. type can have the following values:
    • documents: when this type is used, each line in the request body is expected to be an individual JSON-encoded document. Multiple JSON objects in the request body need to be separated by newlines.
    • list: when this type is used, the request body must contain a single JSON-encoded array of individual objects to import.
    • auto: if set, this will automatically determine the body type (either documents or list).

Test file: list.json

["_key","value1","value2"]
["abc",25,"test"]
["foo","bar","baz"]

If I use type=auto, it appears to be unable to automatically determine that my input file uses the list format.

PS> curl -u root: -X POST --data-binary @list.json 'http://localhost:8529/_api/import?collection=a1&details=true&type=auto'
{"error":true,"errorMessage":"expecting a valid JSON array in the request. got: Expecting EOF","code":400,"errorNum":400}

An explicit type=list does not work either however:

PS> curl -u root: -X POST --data-binary @list.json 'http://localhost:8529/_api/import?collection=a1&details=true&type=list'
{"error":true,"errorMessage":"expecting a valid JSON array in the request. got: Expecting EOF","code":400,"errorNum":400}

Nor does the apparent alias type=array:

PS> curl -u root: -X POST --data-binary @list.json 'http://localhost:8529/_api/import?collection=a1&details=true&type=array'
{"error":true,"errorMessage":"expecting a valid JSON array in the request. got: Expecting EOF","code":400,"errorNum":400}

type=documents fails as expected:

PS> curl -u root: -X POST --data-binary @list.json 'http://localhost:8529/_api/import?collection=a1&details=true&type=documents'
{"error":false,"created":0,"errors":3,"empty":0,"updated":0,"ignored":0,"details":["at position 1: invalid JSON type (expecting object), offending document: [\"_key\",\"value1\",\"value2\"]","at position 2: invalid JSON type (expecting object), offending document: [\"abc\",25,\"test\"]","at position 3: invalid JSON type (expecting object), offending document: [\"foo\",\"bar\",\"baz\"]"]}

Strangely, the import succeeds without type parameter, or if the type parameter has some other / invalid value (with collection truncation between calls).

type=jsonl:

PS> curl -u root: -X POST --data-binary @list.json 'http://localhost:8529/_api/import?collection=a1&details=true&type=jsonl'
{"error":false,"created":2,"errors":0,"empty":0,"updated":0,"ignored":0,"details":[]}

type=json:

PS> curl -u root: -X POST --data-binary @list.json 'http://localhost:8529/_api/import?collection=a1&details=true&type=json'
{"error":false,"created":2,"errors":0,"empty":0,"updated":0,"ignored":0,"details":[]}

no type parameter:

PS> curl -u root: -X POST --data-binary @list.json 'http://localhost:8529/_api/import?collection=a1&details=true'
{"error":false,"created":2,"errors":0,"empty":0,"updated":0,"ignored":0,"details":[]}

type=invalid:

PS> curl -u root: -X POST --data-binary @list.json 'http://localhost:8529/_api/import?collection=a1&details=true&type=invalid'
{"error":false,"created":2,"errors":0,"empty":0,"updated":0,"ignored":0,"details":[]}

@jsteemann
Copy link
Contributor

#496 (comment) describes the server's current behavior.
Not saying that it makes so much sense. How can we go on from here?

@Simran-B
Copy link
Contributor Author
Simran-B commented Jan 15, 2018

Is CSV import distinct from array-style data (list / array)? Or do you mean this by CSV?

["_key","value1","value2"]
["abc",25,"test"]
["foo","bar","baz"]

Document-style format is this:

{"_key":"abc","value1":25,"value2":"test"}
{"_key":"foo","value1":"bar","value2":"baz"}

I don't really understand the purpose of type=auto. I would expect it to automatically determine whether the data is in array or document format (both line-based). It does not work for array-style data in my test against a 3.3.2 server running on Windows 10. Document-style data works without errors.

type=list fails with both formats in my test. It should work with array-style data though, shouldn't it? type=array seems to behave identical to type=list.

Without a type attribute, the array-style format is expected and it works for that format. It fails for document-style data, which is somewhat expected.

Would it be possible to fix auto and make it the default (= use auto if no type is specified), or would that break backward compatibility?

list / array needs fixing IMO. It should work for array-style data, unless I'm mixing up array-style and import of actual CSV data here.

The documentation needs fixing to properly explain when to set which parameter values (it is at least described in a confusing way) and should mention aliases if available.

@jsteemann
Copy link
Contributor

"list" was our old name for "array" (remember we did some internal renaming from "list" to "array" a long time ago). "list" was simply kept to keep it downwards-compatible.
"csv" is actually the array-style input, which seems to have been named CSV because it is normally created by arangoimport when importing CSV files. But the name does not make much sense on the server side, because on the server side the input format is an array and CSV. It's just legacy.

The server-side behavior is as follows:

  • "documents" expects one JSON object per input line
  • "array"/"list" expect a JSON array of JSON documents, with whitespace anywhere as the user sees fit
  • "auto" will handle both the "documents" or the "array"/"list" format. It will peek at the first non-whitespace character in the input, and go into "array"/"list" mode if the first character is a [, and into "documents" mode for any other character.

@Simran-B
Copy link
Contributor Author
Simran-B commented Jan 16, 2018

I see. So there's a total of 4 formats:

  • CSV: one record per line, separated by e.g. ,, fields optionally enclosed in quote marks. First row contains the column labels / attribute keys. Can be imported into ArangoDB with arangoimp, but is not directly understood by arangod.

    FirstName,LastName,Age
    John,Smith,35
    Katie,Foster,28
    
  • CSV-style arrays: each line is a JSON array. The first line is an array of strings with the attribute keys. The subsequent lines are arrays with the attribute values and can be of arbitrary type. The input in its entirety is not valid JSON. It is valid JSONL however. Supported by arangod by omitting the type parameter. Arangoimp reads CSV and sends this format (basically JSONL arrays) to the server. Arangoimport does not support this as input format however.

    ["FirstName","LastName","Age"]
    ["John","Smith",35]
    ["Katie","Foster",28]
  • JSONL: one JSON object per line. Each line will become a document. Each line is valid JSON, but the entirety of the input is not (unless there is only one line). Understood by arangod, type documents or auto.

    {"FirstName":"John","LastName":"Smith","Age":35}
    {"FirstName":"Katie","LastName":"Foster","Age":28}
  • JSON: an array of JSON objects. Each element will become a document. The input in its entirety is valid JSON. Understood by arangod, type array / list or auto.

    [
      {"FirstName":"John","LastName":"Smith","Age":35},
      {"FirstName":"Katie","LastName":"Foster","Age":28}
    ]

I wasn't aware that JSONL also permits arrays at the top level - but since it is line-based JSON and top-level arrays are perfectly fine in JSON, it makes sense.

It is not clear from the docs however, what the supported formats are and how to use the type parameter correctly. I suggest we add format examples and a description of them at the top.

@pluma
Copy link
Contributor
pluma commented Jan 18, 2018

I'm pretty sure the current logic in arangojs is wrong. If this can be clarified, I'd love to fix it. Maybe we can also clarify it right in the docs so other API users and driver maintainers will know what to do.

@Simran-B
Copy link
Contributor Author

@pluma arangojs seems to default to auto, whereas the API has no default and waits for CSV-style arrays in that case. To import CSV-style arrays with arangojs (like shown in the second example of the import method), one has to pass type: "" or type: undefined to overwrite the default auto.

I don't particularly like the fact that there is no actual type for CSV-style arrays. It would be much easier to understand if we had a distinct option to use it. Not sure what to call it though, "csv-arrays"?

@pluma pluma added the Discussion Warrants further discussion. Probably an incomplete feature request. label Mar 6, 2018
@pluma pluma added the Bug A code defect that needs to be fixed. label Mar 6, 2018
@pluma pluma closed this as completed in 024edac Aug 16, 2018
pluma added a commit that referenced this issue Aug 27, 2018
Fixes #461. Fixes #496.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug A code defect that needs to be fixed. Discussion Warrants further discussion. Probably an incomplete feature request.
Projects
None yet
Development

No branches or pull requests

3 participants
0