Requests
Fetching remote data, then parsing or querying it
The web is all about fetching data from remote sources. Fortunately, SEL provides an event type to make these requests that support common data formats.
Request
A Request
fetches data from a remote location (URL). If we want, for example, to pull in the contents of a published Google Sheet, we can make a GET
or POST
request like so:
=Request('get', 'https://docs.google.com/spreadsheets/d/e/2PACX-1vSRSc0h4ZvFGaY8ZmCSRHjc5nGE80tNKKvPwyAgivd35eprIS/pub?gid=944422473&single=true&output=csv')
This event will fetch the contents of this URL and store them in the event. The value of the event in the results table and log will be set to the status code of the response. All responses less than 400
are considered successful.
We can also pass common arguments params
, cookies
, json
, and headers
to Request
. Let's look at an example in context:
#!/usesummit/sel/0.1a
# Ping wayback machine to get page closest to now, scrape it, and clean it.
"url_now": =Object({"params": {"url": "glideapps.com"<customer_url>}})
"wayback_machine_today": =Request('get', "http://archive.org/wayback/available")
In this example, we create an Object to store the parameters that need to be passed to the Request
.
API key security
Many endpoints and API's require the use of keys, usernames, passwords, and/or tokens, in order to authenticate. To use these in SEL while keeping them secure, you should place them in your account Vault. This allows you to refer to them in your SEL code using liquid syntax, e.g.
{{ MY_VENDOR_API_KEY }}
instead of inserting them as text. This keeps these keys encrypted, out of code, and out of any database.
Since a lot of endpoints (like the Wayback Machine itself) don't like being barraged by requests, we can set an additional configuration option sleep
on the object we pass in to the request:
# Let's be nice and honor their desire to not get hit with a lot of requests.
"url_now": =Object({"params": {"url": "glideapps.com"<customer_url>}, "sleep": 2}})
"wayback_machine_today": =Request('get', "http://archive.org/wayback/available")
This will cause the model to pause (sleep) for 2
seconds before making the request. This helps to prevent response codes like 429
(Too Many Requests).
Example:
# Describe configuration.
"request_config": =Object({"sleep": 2, "timeout": 10, "cache_duration": 3600, "params": {...}})
# Define the request.
"my_request": =Request("get", "https://api.vendor.com/v1/docs")
# Pass the config to the request.
"request_config" -> "my_request"
Screenshots
If you'd like to take a screenshot of a webpage, use img
as the method:
=Request('img', 'https://stripe.com')
This will return a base64-encoded image/png
of the full contents of the web page specified by the second argument. This can be passed to a vision-aware AI like gpt-4o
or POST'd to a service that allows you to upload images.
Web Scraping
The request event can be used as a web scraper by defining scrape
rules (see below) in an Object that you pass to your request. The general pattern is a string defining where you'd like to store the content followed by a CSS selector or XPath to the element.
"request_options": =Object("scrape": {
"title" : "h1",
"subtitle" : "#subtitle",
})
"request": =Request("get", "https://stripe.com")
"request_options" -> "request"
This will extract the h1
text and return it under a "title" attribute, and the text contained inside a DOM element whose id
attribute is subtitle
and return it in an attribute called "subtitle."
Request configuration options
A full list of options that can be sent to Request
using a preceding Object
. The Option in the table is the object key.
Option | Purpose | Value | Default |
---|---|---|---|
params | Querystring arguments. | A dictionary of key-value pairs. | None |
headers | Request headers. | A dictionary of key-value pairs. | None |
json | JSON body arguments. | A dictionary of key-value pairs. | None |
cookies | Cookies to include. | A dictionary of key-value pairs. | None |
sleep | Place a pause before making the request. | Any number or decimal (seconds) | 0 |
timeout | Tell the request how long to wait before considering the request a failure. | Any number of seconds. | 10 |
cache_duration | Choose how long you'd like Summit to cache the response. Useful for higher performance when the underlying data doesn't change very often, or when you want to limit your unique requests to the endpoint. | Any number of seconds. | 60 |
scrape | A set of rules to extract HTML nodes (DOM entities) from a web page. | A dictionary of key-value pairs. | None |
Formats, File Size, and Timeout Limitations
Currently, JSON, CSV, and HTML are the only data formats (or content types) that can be retrieved by a
Request
. Other formats will be rejected and the output of the event will be set to empty.Both JSON and CSV responses will be converted to a JSON format on receipt to enable parsing using a
Parser
event. CSV files are automatically turned into a list of dictionaries (objects with key-value pairs) where the column header is the key.SEL also imposes a 5 MB limit on the size of the content that can be retrieved.
Parser
If you want to use the response of your request, you'll need to create a Parser
event.
These events take a JSONPath argument that defines a search to retrieve the value you want from the response data, like so:
=Parser('$.pricing.express_overnight')
The parser event itself will extract and pass along the values it finds using your expression as a list of values, but the value will not be stored in the parser event (it is not a container).
To operate on the list output of Parser
, you can use a Transform to, for example, sum the list, a Pool to store the list, or Matches to search the list.
JSONPath resources
SEL uses the
jsonpath-ng
Python module to search your data using this expression. You can read more about this module here: https://pypi.org/project/jsonpath-ng/Writing a complex JSONPath expression can be challenging. In addition to asking ChatGPT, this site is a handy way to test a JSONPath expression against a JSON object: https://jsonpath.com/.
Table
A lot of business processes involve the usage of data stored in a remotely-accessible CSV file. These files are often "published" spreadsheets (using a service like Google Sheets or Microsoft Excel) or even a read-only database view or "clip."
When you want to pull this data into SEL, you can use Table
to fetch and structure the data, and Query
to run ad hoc queries on it.
To pull a remote CSV file accessible by URL into a Summit, you can use:
=Table('https://docs.google.com/...'<csv_url>)
This will spin up a small database inside your model and store the table uniquely inside of it. In other words, you are welcome to use this event in multiple places, and the data will be stored in different tables:
=Table('https://data.heroku.com/clips/...'<data_url>)
In the SQL of downstream Query
events, you will refer to this table as simple $table
. For example: SELECT count(*) FROM $table
.
You may also wish to pass Table
a list of dictionaries, such as the output of a Parser
event. You can do this by leaving the Table
argument empty:
=Table()
In either case, you may provide an optional argument for the name of the table:
=Table("contacts")
Or
=Table("https://sheets.google.com/...", "contacts")
This will load the input data (or the data from the remote CSV) into a table in your model called contacts
.
This can be useful for writing a query in an event that is not directly downstream of the table.
Query
Once you have a data file stored inside a Table
, you'll most likely want to run a query on it. In SEL, we can do this by passing SQL to an event like so:
=Query("SELECT min(price) FROM $table WHERE status = 'ACTIVE'")
If we point a Table
at this event, we will see a value in the results table and ledger that answers this query.
If a single value is returned by a query, the value of the event will be that return value. Otherwise, the value of the event will be the number of rows returned by the query. Therefore, the above query will return the minimum price stored for ACTIVE
products in the table.
SELECT price FROM $table WHERE status = 'ACTIVE'
will, on the other hand, return the number of rows in the table.
To do something with the price
values returned by the above query, like sum them, find the average, or max, you should use the transform event which can take a list of values and reduce it to a single result.
If the Table
you are querying was given a name when it was defined, you should refer to it using its name, for example:
=Query("SELECT * FROM members WHERE age >= 65")
An example using Table
, Query
and Response Data.
#!/usesummit/sel/0.1a
"csv_table": =Table("https://docs.google.com/spreadsheets/d/e/2PACX-1vRg7CfZbL3V_h4fNzgl8KqTO7JKSuJ1NqJ03C3i4P6eK12jSDbOGPvTBPhzJMoQmkVEBLYUl2r5v5Sf/pub?gid=0&single=true&output=csv")
"csv_rows": =Query("SELECT * FROM $table LIMIT 2")
"my_csv_response": =Response("my_csv_data")
"csv_table" -> "csv_rows" -> "my_csv_response"
Multiline SQL
You can also write multi-line SQL queries by wrapping your query in """
:
#!/usesummit/sel/0.1a
"csv_table": =Table("https://docs.google.com/spreadsheets/d/e/2PACX-1vRg7CfZbL3V_h4fNzgl8KqTO7JKSuJ1NqJ03C3i4P6eK12jSDbOGPvTBPhzJMoQmkVEBLYUl2r5v5Sf/pub?gid=0&single=true&output=csv")
"csv_rows": =Query("""
SELECT * FROM $table LIMIT 2
""")
"my_csv_response": =Response("my_csv_data")
"csv_table" -> "csv_rows" -> "my_csv_response"
Summit SQL == SQLite
Summit uses in-memory SQLite databases for
Table
, so all of yourQuery
syntax should observe what's included in theSQLite
dialect.Learn more about SQLite's syntax and function library.
Updated about 10 hours ago