NGSIToHDFS

Functionality

NGSIToHDFS processor, or simply NGSIToHDFS is a sink designed to persist NGSI-like context data events within a HDFS deployment. Usually, such a context data is notified by a Orion Context Broker instance, but could be any other system speaking the NGSI language.

Independently of the data generator, NGSI context data is always transformed into internal NGSIEvent objects at Draco sources. In the end, the information within these events must be mapped into specific HDFS data structures at the Draco sinks.

Next sections will explain this in detail.

Name	Default Value	Allowable Values	Description
HDFS Host	no	localhost	FQDN/IP address where HDFS Namenode runs, or comma-separated list of FQDN/IP addresses where HDFS HA Namenodes run.
HDFS Port	no	14000	14000 if using HttpFS (rest), 50070 if using WebHDFS (rest), 8020 if using the Hadoop API (binary).
HDFS username	yes	N/A	If `service_as_namespace=false` then it must be an already existent user in HDFS. If `service_as_namespace=true` then it must be a HDFS superuser.
HDFS password	no	N/A	Password for the above `hdfs_username`; this is only required for Hive authentication.
NGSI version	no	v2	list of supported version of NGSI (v2 and ld), currently only support v2
Data Model	db-by-entity	db-by-entity	The Data model for creating the Columns when an event have been received you can choose between: db-by-service-path or db-by-entity, default value is db-by-service-path
Attribute persistence	row	row, column	The mode of storing the data inside of the Column allowable values are row and column
Default Service	test		In case you dont set the Fiware-Service header in the context broker, this value will be used as Fiware-Service
Default Service path	/path		In case you dont set the Fiware-ServicePath header in the context broker, this value will be used as Fiware-ServicePath
enable_encoding	no	false	true or false, true applies the new encoding, false applies the old encoding.
enable_lowercase	no	false	true or false.
data_model	no	dm-by-entity	Always dm-by-entity, even if not configured.
file_format	no	json-row	json-row, json-column, csv-row or json-column.
backend.impl	no	rest	rest, if a WebHDFS/HttpFS-based implementation is used when interacting with HDFS; or binary, if a Hadoop API-based implementation is used when interacting with HDFS.
backend.max_conns	no	500	Maximum number of connections allowed for a Http-based HDFS backend. Ignored if using a binary backend implementation.
backend.max_conns_per_route	no	100	Maximum number of connections per route allowed for a Http-based HDFS backend. Ignored if using a binary backend implementation.
oauth2_token	no	N/A	OAuth2 token required for the HDFS authentication.
service_as_namespace	no	false	If configured as true then the `fiware-service` (or the default one) is used as the HDFS namespace instead of `hdfs_username`, which in this case must be a HDFS superuser.
csv_separator	no	,
batch_size	no	1	Number of events accumulated before persistence.
batch_timeout	no	30	Number of seconds the batch will be building before it is persisted as it is.
batch_ttl	no	10	Number of retries when a batch cannot be persisted. Use `0` for no retries, `-1` for infinite retries. Please, consider an infinite TTL (even a very large one) may consume all the sink's channel capacity very quickly.
batch_retry_intervals	no	5000	Comma-separated list of intervals (in miliseconds) at which the retries regarding not persisted batches will be done. First retry will be done as many miliseconds after as the first value, then the second retry will be done as many miliseconds after as second value, and so on. If the batch_ttl is greater than the number of intervals, the last interval is repeated.
hive	no	true	true or false.
hive.server_version	no	2	`1` if the remote Hive server runs HiveServer1 or `2` if the remote Hive server runs HiveServer2.
hive.host	no	localhost
hive.port	no	10000
hive.db_type	no	default-db	default-db or namespace-db. If `hive.db_type=default-db` then the default Hive database is used. If `hive.db_type=namespace-db` and `service_as_namespace=false` then the `hdfs_username` is used as Hive database. If `hive.db_type=namespace-db` and `service_as_namespace=true` then the notified fiware-service is used as Hive database.
krb5_auth	no	false	true or false.
krb5_user	no	empty	Ignored if `krb5_auth=false`, mandatory otherwise.
krb5_password	no	empty	Ignored if `krb5_auth=false`, mandatory otherwise.
krb5_login_conf_file	no	krb5_login.conf	Ignored if `krb5_auth=false`.
krb5_conf_file	no	krb5.conf	Ignored if `krb5_auth=false`.

NGSIToHDFS

Functionality

Mapping NGSI events to NGSIEvent objects

Mapping NGSIEvents to HDFS data structures

HDFS paths naming conventions

Json row-like storing

Json column-like storing

CSV row-like storing

CSV column-like storing

Hive

Example

NGSIEvent

Path names

Json row-like storing

Json column-like storing

CSV row-like storing

CSV column-like storing

Hive storing

Administration guide

Configuration

Use cases

Important notes

About the persistence mode

About the binary backend

About batching

About the encoding

Mapping NGSI events to `NGSIEvent` objects

Mapping `NGSIEvent`s to HDFS data structures

`NGSIEvent`