1. Po raz pierwszy odwiedzasz EDU. LEARN

    Odwiedzasz EDU.LEARN

    Najlepszym sposobem na naukę języka jest jego używanie. W EDU.LEARN znajdziesz interesujące teksty i videa, które dadzą Ci taką właśnie możliwość. Nie przejmuj się - nasze filmiki mają napisy, dzięki którym lepiej je zrozumiesz. Dodatkowo, po kliknięciu na każde słówko, otrzymasz jego tłumaczenie oraz prawidłową wymowę.

    Nie, dziękuję
  2. Mini lekcje

    Podczas nauki języka bardzo ważny jest kontekst. Zdjęcia, przykłady użycia, dialogi, nagrania dźwiękowe - wszystko to pomaga Ci zrozumieć i zapamiętać nowe słowa i wyrażenia. Dlatego stworzyliśmy Mini lekcje. Są to krótkie lekcje, zawierające kontekstowe slajdy, które zwiększą efektywność Twojej nauki. Są cztery typy Mini lekcji - Gramatyka, Dialogi, Słówka i Obrazki.

    Dalej
  3. Wideo

    Ćwicz język obcy oglądając ciekawe filmiki. Wybierz temat, który Cię interesuje oraz poziom trudności, a następnie kliknij na filmik. Nie martw się, obok każdego z nich są napisy. A może wcale nie będą Ci one potrzebne? Spróbuj!

    Dalej
  4. Teksty

    Czytaj ciekawe artykuły, z których nauczysz się nowych słówek i dowiesz więcej o rzeczach, które Cię interesują. Podobnie jak z filmikami, możesz wybrać temat oraz poziom trudności, a następnie kliknąć na wybrany artykuł. Nasz interaktywny słownik pomoże Ci zrozumieć nawet trudne teksty, a kontekst ułatwi zapamiętanie słówek. Dodatkowo, każdy artykuł może być przeczytany na głos przez wirtualnego lektora, dzięki czemu ćwiczysz słuchanie i wymowę!

    Dalej
  5. Słowa

    Tutaj możesz znaleźć swoją listę "Moje słówka", czyli funkcję wyszukiwania słówek - a wkrótce także słownik tematyczny. Do listy "Moje słówka" możesz dodawać słowa z sekcji Videa i Teksty. Każde z słówek dodanych do listy możesz powtórzyć później w jednym z naszych ćwiczeń. Dodatkowo, zawsze możesz iść do swojej listy i sprawdzić znaczenie, wymowę oraz użycie słówka w zdaniu. Użyj naszej wyszukiwarki słówek w części "Słownictwo", aby znaleźć słowa w naszej bazie.

    Dalej
  6. Lista tekstów

    Ta lista tekstów pojawia się po kliknięciu na "Teksty". Wybierz poziom trudności oraz temat, a następnie artykuł, który Cię interesuje. Kiedy już zostaniesz do niego przekierowany, kliknij na "Play", jeśli chcesz, aby został on odczytany przez wirtualnego lektora. W ten sposób ćwiczysz umiejętność słuchania. Niektóre z tekstów są szczególnie interesujące - mają one odznakę w prawym górnym rogu. Koniecznie je przeczytaj!

    Dalej
  7. Lista Video

    Ta lista filmików pojawia się po kliknięciu na "Video". Podobnie jak w przypadku Tekstów, najpierw wybierz temat, który Cię interesuje oraz poziom trudności, a następnie kliknij na wybrane video. Te z odznaką w prawym górnym rogu są szczególnie interesujące - nie przegap ich!

    Dalej
  8. Dziękujemy za skorzystanie z przewodnika!

    Teraz już znasz wszystkie funkcje EDU.LEARN! Przygotowaliśmy do Ciebie wiele artykułów, filmików oraz mini lekcji - na pewno znajdziesz coś, co Cię zainteresuje!

    Teraz zapraszamy Cię do zarejestrowania się i odkrycia wszystkich możliwości portalu.

    Dziękuję, wrócę później
  9. Lista Pomocy

    Potrzebujesz z czymś pomocy? Sprawdź naszą listę poniżej:
    Nie, dziękuję

Już 62 437 użytkowników uczy się języków obcych z Edustation.

Możesz zarejestrować się już dziś i odebrać bonus w postaci 10 monet.

Jeżeli chcesz się dowiedzieć więcej o naszym portalu - kliknij tutaj

Jeszcze nie teraz

lub

Poziom:

Wszystkie

Nie masz konta?

Google I/O 2009 - ..Scalable, Complex Apps on App Engine


Poziom:

Temat: Media

Slatkin: Welcome to my talk.
This is "Building Scalable, Complex Apps on App Engine."
I am Brett Slatkin.
I am a software engineer on the Google App Engine team.
And...yeah. So let's just jump right in.
So the agenda-- here's the agenda.
So we're gonna talk about two subjects today
that are, I think, really interesting
for building App Engine apps.
The first one is list properties.
We're gonna talk about what they are,
how they work.
I'm gonna give you a concrete example
of microblogging
to illustrate why they're useful.
And then we're gonna talk about maximizing the performance
of list properties,
with some detail in there.
The second topic is merge-join.
We're gonna talk about what it is,
how it works, and list property magic--
some really cool things you can do with App Engine
that people don't realize is actually possible.
And the concrete example there that I'm going to show
is modeling the social graph.
And again, there's some other detail.
I didn't get hooked up on the Moderator thing
in time for this, but I have just set up my own here--
tinyurl.com/complextalk.
So please hop on, put in some questions,
if you're confused by anything,
vote up on anything,
hopefully that'll work.
All right, so let's-- let's go in here.
So I guess the one little thing I wanted to say
before all this was...
it's been a really exciting year
on App Engine, on the App Engine Team--
seeing all the developers do all these things,
seeing our developers wrap their heads around
all the crazy things that we ask you to do
with our datastore.
I am gonna make this even harder for you
to understand, maybe.
I'm gonna just throw you into the fire a little bit
because people seem to like that.
So if I'm going too fast or it's a little complicated,
please ask me a question--
especially on Moderator--
but hopefully it'll make sense.
But, you know, if it doesn't make sense,
hopefully I'll explain it in a later slide
and so, yeah, just bear with me
is what I'm trying to say.
Okay, so what is a list property?
So it's just another property in the datastore.
It has multiple values.
So it's an ordered list that maintains its order.
So when you write the entity or you retrieve the entity,
it comes back in the same order--
this property comes back as a list
in the same order that you wrote it.
Now, when you query, it uses an equals filter.
But the interesting thing is that when you do that query,
it matches any of the possible values in that list.
So the entity will match for any individual item
in that list.
And...and one of the things that's also funny
is that--because of the way that App Engine works,
the default sort order on list properties
is not very useful.
It's kind of complicated, and we'll get into why.
So you have to sometimes use composite indexes
if you want some distinct ordering
with list properties.
You'll understand what I mean in a minute.
But they're very easy to use.
I mean, here's an example right here.
I have a "Favorites" model class.
I have a list of colors,
and then I have the user that those favorites belong to.
If I want to assign a list property,
I just--it's a list; I just assign it.
That's it. That's all there is to it.
Now, why would you want to do this, you know?
Like, this is a many-to-one relationship
that I am modeling.
Why would you do this as opposed to
inserting a new entry in the database
for each--for each part of the one-to-many relationship?
Well, there's a few reasons.
First, you can densely pack information.
You can track lists of related items in a way
that's much more efficient in terms of space and CPU.
You could also use multiple lists in parallel
to store "tuples."
So two lists in parallel would be a pair.
Three would be a triple... so on.
It's also just kind of easier to use.
If you look at this example,
this would be a standard one-to-many relationship
in the datastore.
Here again I have the-- a single entity
representing a favorite color.
And I would write multiple of these for each user.
And then to find the--a list of somebody's favorite colors,
I would actually have to do a query
and retrieve the data.
So, you know, that's a lot more complicated
than just saying .colors.
But, more importantly, list properties are great
for answering set membership questions.
And this is where the queries come in.
So which colors-- like, here,
an example is which color is like--
which users like the color yellow?
And so here is an example.
I say, "Give me all favorite colors
where the favorite color is yellow."
It gives me back the results,
and then I can pull out the usernames.
It's a simple query to give me
some pretty interesting information
very easily.
Now, this seems really simple,
but this actually has a great fan-out capability.
I can--I can cut across all of my data.
I can match kind of any-- any value of yellow
in users' lists of favorite colors
across all FavoriteColor entities.
So we'll get more into what "fan-out" means later.
But there's--there's some really useful things in here.
Now, why would you use list properties,
again, over the normal one-to-many relationship?
Well, first of all, each list item
only has an index entry.
So normally when you take a-- the App Engine datastore
and you write an entity,
you're writing into multiple indexes.
You write the entity itself into the entities table
and you write into the "by kind" index.
And you write into each of the different property indexes.
And you write into composite indexes.
And so there's a bunch of overhead
with actually storing that entity.
And with a list item,
you only actually add one index row.
That's it.
So--so where, you know, a normal entity, right,
would have, say, five, this would only have one.
So less storage.
You also save the space of the keys
of that entity in that index,
which also saves you more space.
And so, ultimately, you're just saving space.
And it's easier to understand.
It's just a list.
There are a couple of gotchas, though.
It doesn't come for free, you know.
It's all about trade-offs.
So, you know, list properties use more CPU
for serializing and deserializing
than--than doing it the other way.
When you access the entity,
you have to serialize and deserialize the whole list.
We'll get into what that means and why that's bad.
The other thing is, like I said,
the sort orders aren't very useful on their own.
So if you-- if you want to use
list properties and sort orders,
you need to use a composite index.
And you need to be careful that
you only use one list property
in a composite index
or else you get this "exploding" problem.
What is exploding and what is an exploding index?
So here's the idea:
You have a list property with 1,000 entries in it.
And then you have another list property
with 1,000 entries in it.
If you want to build a composite index on both,
you need the--the cross product of all of them.
So you actually need 1 million index entries
to represent all the possible combinations
of how those could pair up.
And that's-- that's the explosion
in storage space.
So watch out for that.
But there's a lot of stuff you can do that's useful
with creating on just one list property.
And I'll explain.
So let's get into our concrete example.
I feel like every conference I've seen
in the last two years
has somebody explaining
how to make your favorite microblogging service scale.
For real, this time, on my platform,
I--I swear to you--
I guess I'm guilty of that.
But microblogging is actually a very interesting example.
So microblogging--
I mean stuff like Jaiku, Twitter, identi.ca,
some of the other ones out there.
And it's basically a very high speed feed
of small messages.
It represents the publish and subscribe use case.
It's also broadcast, multicast--
kind of the same thing.
The general idea is the user sends a message
and it goes to many other users...
a lot of users.
In some cases, millions of them.
And it's a great example of fan-out,
because one simple user action--
typing in your 140 characters, pushing return--
causes a lot of work to happen,
causes a lot of data that needs to be surfaced,
a lot of people that need to be notified,
and it's so easy to do
that you're constantly trying to catch up with this work.
That's why it's hard to scale this.
And that's just the nature of fan-out.
It's hard.
So let's talk about an example of fan-out,
kind of the naive way of doing it.
Now, fan-out is hard because it can be inefficient.
You can require duplicate data.
And so this is kind of the email example.
If you imagine that each of these users--
"B," "C," and "D"-- are different email users
on different email servers,
when SMTP wants to send a message,
what does it do?
Well, it sends a copy of the message
to each SMTP server.
So that's copying all the bandwidth, all the data.
Each SMTP server needs to manage it differently,
and all the users get their own copy.
And eventually they download it to their own machine,
and there's just a lot of duplicate work happening.
Now, this is kind of a testament
to why SMTP works so well.
It's scaled just fine.
But the throughput's kind of lower in some ways.
And the fan-out's usually pretty low.
People don't usually send emails to a million people.
They send it to 100 or a listserv.
A better way to do fan-out is like this.
You know, not duplicate any data.
So ideally we'd kind of do--
do message sending by reference.
We'd have user "A" send a message.
The message would be-- have all of the different...
users who should get it annotated on it,
And then they would just receive it by reference.
We would be able to figure out, "Oh, here are the messages
that user 'B' should receive."
So let's talk about how to do that model
using a traditional relational database.
This is probably what it would look like.
There's probably a better way to do it,
but this is the first thing that I think would come to mind.
You have a table of users.
You have a table of messages.
Then you have the table of messages that users should get.
So each user has an I.D. and some properties--
their name, et cetera.
You have a table of the messages,
the body of the message, title,
time it was sent, et cetera.
And then you have a-- the relationship
between these two primary I.D.s,
which is the join table.
So how would you do this in SQL?
Well, if you want to find messages for the user "X,"
you would do a join.
You'd select from messages all the user messages--
select from messages all messages
with a "join" on the UserMessages table
using message_id to join the two tables together.
So that's basically connecting the dots.
And then you would say,
"Now only give me the ones for user 'X.'"
And that's how you get that list.
But there are no joins on App Engine.
So how do you do this?
How would you implement this on App Engine?
The answer is list properties.
They--they come to our rescue here.
So with App Engine,
I define a simple message class.
This is a-- just a DB model.
I'm doing this in Python just 'cause it's a little shorter.
I say, "Here's my sender."
I have a list of receivers.
These are the people that should get my messages
by reference.
Then I have a body, which is the actual body of the message.
Now, to figure out what messages I get,
I just do a simple query.
I say, "Give me all the messages where I'm a receiver."
Seems pretty natural. There are no joins.
It's just--it's just done.
And this is essentially how Jaiku works
at its core--
this simple idea of doing fan-out
with list properties.
Now, with JDO you can do the same thing.
Here, again,
you have the autogenerated I.D. of the message.
You have a sender, you have the body,
and you have a list of receivers.
Exactly the same idea.
To do the query-- I'm sorry.
Everyone get it?
All right.
To do the query you do exactly the same thing.
It's a list property with an equals filter.
You get your PersistenceManager,
you create your query on the Message.class,
you filter on the receivers using an equals filter,
and then you get back a message.
And at that point, results.receivers
is the list of receivers that you want to--to access.
Okay, so let's have a-- let's have a go.
Let me show you the actual-- let me show you it in action.
So this is a little app I wrote
called "Publish-subscribe test."
The source code's available online.
I have a link to it later.
And this lets me demonstrate how this works.
So to make this simple, I just made it
so that you specify the number of receivers you want.
And it'll go and send that message
to that many receivers.
So here I can have my nickname: Brett.
I want to send it to, say, 100 receivers,
and say, "Hi from I/O."
Then I submit. Great. Message sent.
So now if I look at Message 1,
I can retrieve it.
And then I have a list of them here.
I think--I think I have too many examples in here.
I didn't do paging yet, but it's in there.
And I'll get to why I have all these--
all this demi-data in just a sec.
Right. So pretty simple.
Really easy. Really simple query.
Easy to do writes.
So let's talk about the performance of this, though.
So index writes are done in parallel on BigTable.
That means that it's fast. It's really fast.
You can update a list property with 1,000 items.
That'll actually do 1,000 different writes
to BigTable simultaneously in parallel,
which is-- that's a lot of throughput.
The write time complexity, cost CPU time, the latency--
it scales linearly with the number of items.
And right now we limit--
we limit you to 5,000 indexed properties per entity
kind of for safety.
And the storage cost is about the same
as a relational database system.
In a relational database,
you'd have that user messages table.
You'd have the user key and you'd have the message key
in the join table,
and that would be your cost of sending a message.
You'd have to write different rows for each person
that needs to receive the message.
In the datastore, it's exactly the same way.
You have the entity key
and then you have the list property value--
the person who should receive it.
And those are the only things that you need to store.
So it's just primary keys.
But there's a downside-- I kind of hinted at it--
Serialization.
So the problem is that
writes have to pack your list
into one serialized protocol buffer.
Protocol buffers are an internal tool--
it's now external, available for you.
It's a very quick serialization
marshaling kind of protocol.
And it's really fast, and it's space efficient.
But even so, if you have 1,000, 2,000 items,
that packaging can take a little while.
Now, for writes this is okay
because writes are relatively infrequent.
They're a lot less frequent than reads.
The killer here is that
queries also have to unpackage all the result entities.
So if you remember our message class--
I'm gonna zip back here.
Our message class has a list of receivers.
If you have 1,000 receivers in there,
that's gonna take a little while
if you receive just one.
Now, what if you want to take 100 messages?
Then you have to deserialize 100 times 1,000--
100,000 different list property values--
from your entities.
And that serialization cost gets pretty high.
So we've got a problem.
And it's-- it uses a lot of CPU.
It uses a lot of latency, datastore latency,
so end user latency is pretty high.
It's--it's-- it's just not very good.
And so now you're saying,
"Well, why are you telling me about all this stuff
if I can't use it," right?
Well, luckily, I have a solution.
But we'll get to it.
So really, when I query for messages,
I only really want the message information.
I don't really care about the index
that got me the information after the fact.
The problem with App Engine: a little--in a bit
is that, you know, you're carrying your indexes with you
because you're using a denormalized schema
a lot of the time.
The receivers list is a denormalized index.
And when you retrieve an entity,
you pull that whole index down also.
But we don't really care about the list property
after we query.
We, you know, we only care about the message body,
the sender, and metainformation.
So what if we could selectively skip certain properties
when we're querying?
So, for instance, the receivers list--
just don't give it back to me.
Then we could avoid the serialization cost.
Then we'd just get the message bodies that we care about.
And ideally we could do something like what SQL does,
where we can just say, "Hey," you know,
"give me only these columns of the table."
But you can't do that.
We don't have support for that yet.
Still trying to figure out how that works
with our whole all-around scheme in the DB module.
But we have a solution.
And that's something called the relation index entity.
This is kind of a new thing.
I don't--I don't know what the best name for it is,
but I'm calling it a relational index entity.
So the idea is really simple.
Let's split the message into two entities, okay?
The first one is the message model.
That contains what we care about.
It has the body, and the sender,
and the metainformation that we actually want to access
once we've done the query.
Then we have another one, which is the MessageIndex.
And the MessageIndex just has
the fields we want to query on.
Nothing else. So here they are.
You can see, I just split it in half.
That's all I did.
Now, the key thing
is that we put them in the same entity group
so that we can transactionally operate on both of them
at the same time.
And we make the MessageIndex a child of the message.
What that means is that the MessageIndex primary key
implicitly points to the-- to the messages--
to its parents' primary key.
In the App Engine datastore,
all primary keys include the full path
of--of that entity and its parent elements.
So...so in this case,
if we had just the MessageIndex,
I know who its parent is just by looking at the key.
All I need is the key.
So a new thing that we just launched--
I think it was in our 1.2.2 launch
just a couple of weeks ago--
was something called the key-only query.
And a key-only query lets me do...
a key-only fetch on the message index type.
What does that mean?
So I want to do a query and only get back the keys
that match.
Then I want to transform those keys
to de-reference the pointer to their parents, essentially.
And then I want to fetch the actual messages
that I care about in batch, very quickly.
And so here's an example of that.
I do a GQL query. I say,
"Select the key from the MessageIndex."
Just show me the matching entities.
Don't pull down the whole MessageIndex entity.
Don't--don't deserialize the list properties.
Just avoid it. Just give me the keys.
And then, you see, on the next line I say
keys = k.parent.
That traverses the key to give me the parent entity,
which is a message.
And then I say db.get(keys)
and that gets all of those keys in parallel.
Now, under the covers,
this is what App Engine's datastore is doing anyway.
This is--this is-- this is how we work.
We--we scan one index,
we figure out all the keys you need,
and then we grab them all from the actual entities table
in parallel.
Ryan Barrett's talk
on "Under the Covers of the App Engine Datastore"
gets into some-- from last year
gets into some really interesting detail
about how that works.
But hopefully you can see the efficiency
we're gonna get from this,
because we've completely avoided
all of the list property deserialization.
We can-- we can just figure out
what--what entities we need to get,
go and retrieve them,
and then-- and then pull down
the part of the message we actually care about.
So I--I have this written up also in that same demo.
And what's cool is I have a little check box
to do either-- either of the examples.
So I have a little timing box I can click on.
Then I click "Retrieve".
So you'll see this is retrieving, I think,
ten list property entities,
ten message entities.
Each one has 2,000 receivers in it.
And it takes 1.3 seconds.
That is a really long time. That is way too long.
1.6-- so it's got some variants.
It's--it's way too slow.
That's just for ten.
But if I change this to use the key-only ones--
again, I have ten entities in there--
oh, need timing on.
You see it's a lot faster.
It's--it's ten times faster.
It's--it's sometimes faster than ten times faster.
It's--it's way better.
And this is because I'm avoiding all the serialization costs
associated with serializing and deserializing
those list properties.
So what's the conclusion here?
Relation index entities
give us much better performance.
We have the same cost for doing writes
but reads are ten times faster and cheaper.
We have--ten times faster and cheaper
than the old way of doing it,
where we had to do deserialization.
And we have the best of both worlds with list properties.
We have very low storage costs
because we only write one index row
for each list value
and the low CPU cost
because the queries are really easy to do.
So this is a way of doing fan-out.
It's a very efficient way of doing fan-out.
Now, what if we want to keep going further?
What if we need more fan-out?
I said there was a limit of 5,000 entit--
5,000 index rows.
Well, in that case, if we need more indexes,
you can just write multiple index--
relation index entities per message.
There's nothing stopping you
from having multiple child entities
that refer back to their parent.
You can add these index entities in the background
using our new Task Queue API
which I'll talk about tomorrow.
And I think this provides a solution
to the million-fan-out problem.
I think that the million-fan-out problem
is kind of a really hard problem
that people have been talking and talking about
how to solve for a while.
And I think this'll let you do it.
I think the other thing that's cool about this
is that you don't have to migrate your schema to do it.
If you say, "Hey, I need an index on 'X'"
suddenly for this user,
just go through and write a new relation index
for each of your users as a child element.
That's all you have to do.
You don't have to change any of your schema.
They're just totally decoupled--
your indexes and your data.
So the final picture of what you'll have
is you'll have the message, which is what you care about,
and then just a series of indexes
written as child entities.
And you use those child entities as references
to the parent to find the parent,
but then don't-- you don't deserialize them ever.
And--and what you're doing here really
is kind of accessing
a lower-level BigTable-like interface
of doing row range and index scans.
So if you have a problem doing fan-out,
doing set membership fan-out,
I encourage you to look at list properties.
Okay.
This going okay so far?
People with me? Yeah?
I see some nods. That's good.
Okay, so merge-join.
Merge-join is a funny word.
Merge-join is-- is a--is the secret sauce
that I don't think people realize is there.
So what is it?
Well, people say, "Oh, App Engine,
"I'm not gonna use App Engine.
App Engine doesn't support joins."
Well, that's not true, actually.
We don't support natural joins
or inner joins or outer joins.
Those are useful joins. We don't support those.
But we do support merge-joins.
A merge-join is a self-join.
We can join tables with its--
we can join a table with itself.
And what that means is you can combine
multiple equality filters--
multiple equality tests in a single query.
And so what that lets you do
is determine Venn-diagram-like overlaps in sets.
So an example is something very useful
for your business application:
I want to see the overlap of spots, horns, and legs.
Obviously that's a cow-- four legs.
Now, if I want to do a query to figure out this data
in some taxonomy of animals,
merge-join is what you can use to do it.
I'll show you why.
But why would you want to use merge-join?
Well, it's great for exploring your data,
data mining-like operations.
There's--the practical limit of equality tests
is pretty high.
You can have ten or more filters pretty easily.
And you don't have to build indexes in advance
to do these queries.
So if you're doing a Venn-diagram-like query,
you don't even have to build an index.
You can just do it right now.
You can--you can even do it from the shell application
right in your deployed App Engine app.
And that's--that great for doing ad hoc queries
if you want to figure out something.
It also reduces cost, 'cause indexes take space,
index has to take time to build.
And you can actually provide
some pretty advanced functionality
with Venn-diagram-like operations.
So an example is what you have in Gmail.
Think of a Gmail filter
or a Gmail search that you can do.
You have various labels applied,
read/unread status, a month, year, day,
number of replies, recipients, et cetera.
These are all properties of a message.
If you think about it,
these are all just Venn-diagram-like overlaps
in sets.
All you're doing is testing set membership.
But you're doing those-- those tests in parallel,
together, in a single filter.
So what's an example of merge-join?
Well, here's my very useful animal class.
I have a list of what it has--
in this case, horns.
The color is spotted.
And it has four legs.
And if I want to do a query to find animals like this animal
I--I can just do this query:
SELECT * FROM ANIMAL
where the color is spots, has horns, legs 4.
Now, that doesn't look like a join.
But it is if you think about it.
And I'll show you what I mean.
Well, I'll show you what I mean in a second
and why it's relevant.
But let's talk about
how this actually works in App Engine.
So...so--well, first let's talk about how
it works in a relational database.
Well, the relational database usually--
the query planner has histograms.
The query planner knows that
I have this many entity--
this many records that have spots in this value.
I have this many records that have horns in this value.
I have this many records that have four legs.
And then it determines a query plan
to do the smallest--
to scan through the smallest result set.
And then it starts just either going through memory,
or through B-trees on disk,
to go and actually do a linear scan
through all of your entities
and then find the matching one.
And that's--that's how-- that's how SQL's doing it.
So...so how do we do that?
We don't have histograms.
Well, first of all, it's not available in BigTable.
There are similar optimizations
that work like this in other DB systems.
But this is some special sauce that we have,
we've had since launch.
So, first of all, we store all property indexes
in sorted order, ascending order.
If you know anything about BigTable,
that should sound familiar
because that's what BigTable does.
So it's-- everything is stored
in lexical order, ascending order.
And then, the datastore
essentially does a merge-sort at runtime.
So we take this sorted list
and we merge it with itself in different locations.
And we use a zig-zag algorithm to make this efficient,
to efficiently join tables.
And what this lets us do is scan a single BigTable index
in parallel, and it's so easy, and--
Yeah. This is really hard.
This is not a Google interview question,
but it sure sounds like one.
And I'm not going to be able to explain this
without actually showing you how it works.
So let me show you what it works.
I can talk about "zig-zag" and merge
and scans and all this stuff,
but you just need to see it to understand it.
So here's-- let's say this is
a BigTable index, okay?
These are--the tables represent property indexes.
Now, this is in ascending order.
You'll see that, for each property,
I have its value: color=red.
And then I have the primary key
of the entity.
So... color equals red ants,
spotted bear, spotted cow,
white dog.
And these are different sections of the same BigTable,
effectively, which is the animal--
the animal property table.
So the first thing I do--
if you go back to the GQL Query--
I'm gonna do--I'm doing three separate equality filters.
I'm doing one on color, has, and legs.
So first I-- so I start with a color.
Then I scan through BigTable to the row
that matches spots.
Then I say, "Okay, that sounds great."
I found spots. I found the first part of my filter.
So this part-- this equality--
this equality test has been matched.
So now it moves on to the next one,
which is has.
And then it scans through and goes, "Ah!
Now I found this key with horns that also matches."
But what's important here,
if you look for number one,
the key I have matched is bear.
For number two, the key I have matched is cow.
Now cow is lexically, alphabetically after bear,
which means that bear is wrong.
We need to actually find the next thing after bear.
But we need to find the next thing after bear
but before cow or equal to cow.
So this is the "zig." We move number one.
We say, "Okay, well, these keys don't match."
Bear and cow don't match.
So we'll start moving the window on number one
so we match cow.
We need those keys to match
because if we don't have a matching key,
we don't have a matching result.
So now we're satisfied again with two filters.
We have color equals spots, key is cow,
has is horns, key is cow.
We've--we've satisfied these two equality tests.
But now we move on to the next one.
We start off with legs=4.
The first thing-- result we get is a cat.
Again, the key does not match.
Lexically, the key cat is before cow.
So again, we do a BigTable scan
to scan after this-- this row
but before a--a cow,
which are the other results we have.
And that's the "zag."
And that moves--that moves the--that iterator down.
And now we have a match.
We have--the key-- the keys all are the same
and all of the equality tests are satisfied.
So we've joined this table with itself.
Okay. Does that make sense?
Cool.
So I like zig-zag. I think it's pretty cool.
Good name. Good word.
So let's go with the concrete example.
What would you actually use this for?
So let's talk about modeling the social graph.
So a social graph is just a good example for this.
You can model all kinds of graphs.
Graphs are very useful for doing all kinds of things.
So anything you can represent with nodes
and relationships between nodes, vertices, whatever,
you can-- you can do this way.
So a social-- the social graph's useful
because the questions are easy.
So, you know, each user has a profile
and a set of friends.
We're gonna use merge-join on list properties
to do some magic for us.
So merge-join lets us answer all these great questions
about relationships.
First of all, who are my friends?
Who are my friends in San Francisco?
Which friends do I have in common
with somebody else?
Which friends do I have in common with somebody else
in San Francisco--
where somebody else is a specific person?
And if you look at a lot of social sites out there,
these are the queries you see.
These are the-- you know, the--
on the main profile pages and dashboards.
They say, "Hey," you know,
"you have these friends in common," and so and so.
Now, for simplicity I'm just gonna do
relationships two-way.
There's no reason you couldn't use this for a DAG.
It's just easier this way.
So...
here's my group of friends.
These are the people:
Mel, Bob, Stu, Willie, Lenny, Carl.
Maybe the names are familiar.
They live in three locations.
The lines are the friendships between them.
Now, if we want to answer a query like:
Who's a mutual friend of Willie and Carl?
That's easy. It's Lenny.
There he is.
You see that relationship with the arrows.
Similarly, if we want to figure out
who's a mutual friend of Bob and Willie in San Francisco,
we know that's Stu.
Now, you see how this is a Venn-diagram-like query.
That's what we're figuring out.
But we're operating on a graph.
Now, how would you do this with a relational system?
You'd have a person table,
which has the user and their location,
and then you'd have a table of friends.
You'd normalize it somehow
to make sure you don't have duplicates,
but you'd have User "A" and User "B"
and their friends.
And then to-- so then simple queries like:
Who--who are the friends of user "X"?
You'd just do a simple join
between the users table and the friends table.
You say, "Give me all the users,
"join on friends,
where the user I.D. matches the friend I.D."
And--and so here,
the first--the first part I have, user_b_id,
that's the join property.
And then the constraint is user_a equals "X" equals me.
So show me all my friends.
Show me all the friends of user "X."
And if you want to filter by location
you just add another filter to the users table.
And that gets more complicated
when you want to do a graph query.
If you want to do a graph query, now,
you want to figure out:
Give me all the friends common to "X" and "Y."
So you SELECT * FROM Users,
INNER JOIN on Friends 1 and Friends 2--
'cause you're joining on the same table
so you need two representations of that table.
You need to figure out where
the user_id equals f1 user I.D.
and the other user_id equals f2 user I.D.
And then you also need constraints
to make sure that you have user "X" and user "Y"
and then an actual merge that actually does the join,
which is that last line.
It's pretty complicated.
But if you're familiar with SQL, that--that works.
But we don't have inner joins on App Engine
like we're using here.
And--and we don't, you know,
and people have said we don't have joins.
Well, we do have merge-join.
We can do self-joins.
The only really important part of this is the last line,
which is joining the two tables together.
So here's how we do it in App Engine.
We have a person, they have a location,
and they have a list of friends.
And we do a really simple query.
So it's very natural sounding.
It's: Give me all the people
where they have this friend and they have that friend
in San Francisco.
That's it.
That answers--that answers the question.
It'll de-dup.
It'll give you unique result sets.
It'll do that merge-join query
and give it to you.
And you can add as many filters as you need
up to a practical limit.
So let's demo this.
I'm gonna kind of go back
so you can actually see these examples.
So...
I have two-- two things here.
So I'm-- let's say I'm Carl,
and I want to see...
and I'm looking at Willie.
And I want to say:
Who are the friends who are common
to...Carl and Willie?
So you remember...
Carl and Willie, right?
Who are the friends common to Carl and Willie?
And it's Lenny.
Oh!
So Carl, Willie-- Lenny is a common friend.
So there we go.
Pretty easy to--to parse.
Similarly, you can see the friends Lenny and Stu
as being Willie's friends.
Now the next one was answering:
Who is a mutual friend of Bob and Willie in San Francisco?
And here I have this tab.
So...friends in common to Bob and Willie
in San Francisco.
And this application will let you
model these relationships
and do all the various queries.
And it shows all the different queries you can do
with a social graph.
And so you can use this
kind of as an example app
for how to do merge-join where merge-join is useful.
So what's the performance?
Obviously this can't come for free, right?
So merge-join scales with the number of filters you have
and the size of the result set.
The result set--
so that means it's best for queries
with fewer results,
say, less than 100, less than a few hundred.
If you have a lot of overlapping data
or a lot of results that could potentially match the merge-join
that could be problematic.
But usually that's okay
because a lot of these queries-- you're trying to find overlaps.
You're trying to add more and more equalities filters,
trying to narrow that set of messages.
You're looking for a needle in a haystack.
This is really useful for answering those questions.
And the great thing is
it has similar performance as list properties.
So it has the same read/write speed.
There's no storage overhead.
There are no extra indexes to write.
And exactly in the same way with list properties
and relation indexes,
you can avoid all the serialization costs
by using relation indexes.
So you can do all this for exactly the same costs.
Now, there are some gotchas.
You need to watch out for pathological datasets.
Too many overlapping values causes a lot of zig-zagging,
like I said.
That's, effectively, where
we have to keep scanning through the keys to find matches,
and we have to keep going
between our two different windows in the merge-join
to find an overlapping set.
I think that if your result set's small,
then the chances of this are very low.
But you should go back through this, you know, merge-join
and just think through, you know,
what it's doing to understand the potential for this.
Effectively, it's-- when you have to keep scanning
and scanning and scanning inch by inch
to find matches, back and forth.
Another thing that's really important
is that this doesn't work with composite indexes.
So the--the exploding index combinations
won't work here-- keep this from working.
And that's because when you have multiple equals filters--
we're using merge-join to do the merge.
Now, the second you apply a sort order,
or you have any kind of inequality filter
you need to add to this,
it breaks down, because we need to--
we need to store, in our index, the--
the cross product of all of those list properties.
So...so that gets really, really big
really fast, like I said, and it won't work.
What that means is you can't apply sort orders
to some things that are used this way.
Now, that's--that's not-- that's not good.
But there's a couple of things you can do.
So, you know, you can sort in memory.
For a small results set, that's okay,
especially if you use relation indexes.
The--the actual size of that data is very low
and it's cacheable.
The other thing is that, if you remember
from the relation index code that I showed you,
first we got all the keys and then we transformed the keys
and then we retrieved the actual entities we care about.
There's no reason you can't do some filtering up front.
If there's something that you know about your keys
that you can filter them further,
you can do that filtering way in advance
and, you know, filter at the key level
before you even retrieve the--the entities.
So that--that's another way of optimizing this.
But...it's really-- it's really best
for membership tests, Venn diagram tests
where the sorting is done after the fact.
Okay.
So we're gonna do a little bit of wrap-up
and then we're gonna get to some questions.
So I encourage you to use list properties,
use merge-join for many things.
I believe that this can solve fan-out for your application.
It'll let you scale fan-out problems very well,
especially when combined with our Task Queue API.
This'll let you, you know,
have a large set of entities
that match a certain index
without a problem.
You can use this same idea to match geospacial info.
So instead of a list of my friends,
I could have a list of coordinates,
or a list of geographical regions,
or bounding boxes on a globe,
and do queries on that.
Again, these are set membership problems.
Now if I add merge-join to the party,
I can do relationship graphs.
I can do a bunch of graph operations
that are very interesting.
I could also do "fuzzy" values.
So if you have any-- any, you know, fuzzy values
that you need to store,
you can do that with list properties.
For instance, I can store today's date.
I can store the month, today's date,
or I can store the year, the month, the day,
today's date, you know,
the hour, the minute, the second--
I can store a whole list
of all these different resolutions of data
and then query on any one of them
to get a different subsection of the data.
So a lot of people, when they do a query on time,
they say, "Oh, give me everything
between these two times."
And that's not the right way to think about this.
You need to think about converting your theories
into set membership tests.
So compute those memberships at write time
and you'll enjoy very fast reads.
If you want to find everything from today,
just mark it as being from today.
If you want to find everything from this hour,
when you write the--
when you write an entity to the data store,
mark it as being from today
at this hour.
And then all you have to do to retrieve the data
is a single equality filter.
That's it. There's no--
There's no inequality filters required.
If you want to see the demos,
I have put them up available with source code.
This is hopefully useful.
It takes a little while to get them up otherwise--
pubsub-test.appspot.com
and dagpeople.appspot.com.
It's Python code.
And you can do all these same things in JDO.
And we're working on further optimizations for JDO,
but it's all functionally equivalent.
There's more information on our site:
code.google.com/appengine.
So now let's go to some questions.
Let me load this up.
[applause]
Thanks.
So yeah, there are mics here and here
you can come up to.
Yeah, cool.
So I'll just start off with some of these
and then, if you have any other questions,
please step up to the mic.
And if you want to duck out, thanks for coming.
So...
"What about ReferenceProperty and ReferenceListProperty?
"Are they as efficient as StringListProperty?
Can we apply the same practices you described in your talk?"
Yes, you can. You can.
The only reason you don't want to do this, necessarily,
is because ReferenceProperties include keys.
And keys can be bulky sometimes.
It has the kind, the app I.D.,
and the primary I.D. or name of the entity
in that key.
So in terms of actual storage space,
you know, a key might be, you know, 50 bytes
and a single string might be 10.
So it depends on how hardcore
you're trying to do your optimizations.
But, yeah, you can use ReferenceProperties
if you want to do it that way.
So...
"How well does list property scale?
"How many entries can we put in it
and not suffer in performance?"
Hopefully I answered that one.
For writes, you're gonna pay a cost
to do that serialization,
to put it in the datastore.
But to retrieve it back,
you can avoid that serialization cost
and overhead.
Yeah.
man: I wanted to ask...
in merge-join, about the work clause.
You explained the method, with the zig-zag...
algorithm works.
So it seems that it matters,
the order of the clauses inside the work clause
for the zig-zag to work.
And hence it affects on the performance.
Is that true?
Slatkin: So sorry, could you repeat that one more time?
man: Does the order of the statement in the work clause...
Slatkin: Oh.
man: Affect the performance of the query?
Slatkin: Right. That's a good question.
Yeah, so the question, I guess,
is--does the--yeah.
Does the order of the equality filters
in the merge-join affect the performance?
I believe, yes, it does.
So if you--if you know that you have a certain property
that has a high correlation to the results set,
you should list that first.
I think that we don't-- we don't have histograms
and we don't have a query planner to use them,
so any hints you can give us
on the way to find your results set is--is useful.
And I think that setting the equality--
the order of the equality filters will do that.
I need to confirm that,
but that's a really good question.
man: It's not a big difference.
Slatkin: It's not a big difference?
So, yeah, Ryan implemented it along with Max,
and he says it doesn't make a big difference.
I guess BigTable scanning is very fast
so that could be a good explanation.
Any other questions? Yep.
man: So if you have, say, a list of objects
that each have a price,
and the price is dynamic,
but you want to be able to search on...
objects that have a price above a certain value.
Is it possible to do range queries
without having to build a bunch of fancy indexes behind them,
because the data might be changing?
Slatkin: Yeah, so...yeah.
So that's--yeah, so, you know,
if you have data that's changing,
and you don't want to have to have
indexes for range queries, how do you do it?
So this is--this is kind of the--
the answer to these kinds of questions
usually is precompute your views
at write time.
So if I have, you know,
an example would be an auction site.
Show me everything between the price of 0 and 5
5 and 20, 20 and 100,
100 and 1,000.
You know that those are going to be your views.
When you update the price of an item,
add it--add that slice of data to your list property.
And then you can query just for set membership.
You don't have to do a range query.
And if you need to change your ranges later,
or add more ranges,
you just can write some relation index entities
to do that,
or you can update the ones you already have.
And our Task Queue API that we'll be talking about tomorrow
will greatly increase the-- your ability to do that.
But yeah, I mean, when you have code
to update that index,
just--just--you know,
re...just, you know,
change--change the way you're slicing your--your fuzziness.
That's what a fuzzy query is.
That's a really good example of one.
We got some negative questions. That's good.
[chuckles]
"Doesn't the relation index entity
suffer from the n+1 select problem?"
Who asked this question, and could you please come to a mic?
I don't know the n+1 select problem.
Anybody? You, okay.
Can you tell us--what is the n+1 select problem?
man: When you select-- when you select the...
the actual number of elements first,
and then it does the select for each of the element.
If you go back to the slide...
Slatkin: Yeah.
man: There was a "for" loop order.
Slatkin: Oh.
So...
you're talking about...
relation index... so this one.
Is this what you're talking about?
man: Yes, this one. There's a "for" loop, right?
Slatkin: This loop. That's what you're talking about?
man: Yep.
Slatkin: So that loop is just going over
the index result set.
Those entities have already been...
So this is a list of keys, these indexes here.
They're not separate queries.
They are not--they don't require SELECTs.
man: And when you do the db.get of keys
you actually get everything at once.
Slatkin: Exactly. So...so you're doing one SELECT,
one set of BigTable scans-- actually, one BigTable scan
to get all the keys.
And then you're taking all those keys,
transforming them, and doing one, like, hash table lookup
to pull all those values back.
So you don't have-- there's no n+1 problem.
It's just--it's one query, one get.
Cool.
I have one more...
I think this one's a little out of scope,
so, yeah, I can answer this for you later.
This one: "You talked about avoiding list deserialization
"for--you talked about avoiding list deserialization for reads.
"What about for writes?
"Can we add a list value without deserializing
and serializing the entire list?"
So there's a trade-off here.
You could just write another message index entity.
That has a lot of costs associated with it.
So you're paying-- the trade-off--the trade-off is
storage space or CPU time.
You could write a new entity
with that index value,
a new relation index...
when you need a new index value.
But then that's a whole other entity
and it has all of its associated indexes
and costs of the keys and so on.
So you need to figure out that trade-off for yourself.
If you don't really care about storage space so much,
then that's a really good solution.
But storage is usually pretty cheap.
So I could see the trade-off.
But writes are also very infrequent,
so it could make sense to update things.
It would be awesome if we had a way
of just appending to a list property
without having to retrieve it.
That would be the ideal solution.
And then we wouldn't have to do any of this.
You'd just say, "Oh, yeah,
"append this value to a string list
"and don't even come back to me.
I just want you to do it."
And that-- that would be--
that would be one way to solve this also.
Yeah.
man: Yeah. What makes a relation index entity
better than just a join entity,
where you would have the reference to the message
and the reference to the user?
Slatkin: Right, so...
on the join entity-- so yeah.
So--so the question is, yeah.
What makes a relation index better than a join entity?
A relation index, effectively, is a join entity.
That's what it's doing.
It's serving the same purpose.
The reason that it's useful is that it's a child of the parent.
So if you saw...oh.
Right here.
The message index is a child of its parent.
So we can scan for message indexes very quickly,
get their keys, and then figure out what their parents are.
We can de-reference that pointer really quickly
in memory with a minimal amount of serialization.
If you do this the other way,
with just, like, one-to-many relationships,
or with queries, or...
yeah, with a standard, like, one-to-many relationship style.
Because we don't have joins,
you'd have to first do a query
for ten items,
and then do a query for each of those results
to actually find the references.
And that's--that's the problem.
So...so, yeah, MessageIndex
is essentially a join table.
It's a speed-up, like that.
Yeah.
Dredge: Hi, Ert Dredge from Universal Metaphor.
Do you have suggestions
when we're doing our initial data architecture
for helping to implement these suggestions
that might not be obvious from the talk itself?
Slatkin: Yeah, that's a great question.
So are there optimizations that you should make
when implementing your data
or modeling your data not captured here?
Well, there are a few things.
There is a few kind of hidden-- hidden parts of this
that--that we need to be more clear about.
For instance, how do you compute the cost of indexes?
And how do you-- like, storage costs.
And how do you compute the cost of...
also the datastore storage itself?
We're working on making this more clear to people
so they can actually kind of calculate their index--
their--the index size that their data would have.
And how to figure out
the overhead of having composite indexes
and list properties.
So that's something that would be best for you--
when you're modeling your data, you say,
"Oh, well, if I need this composite query,
that could explode," or that, you know,
"that's gonna require an extra row,"
and then that has a large cost.
Another thing to remember is that
in a standard SQL database,
primary key is usually just an integer
which has a very small amount of storage space.
With us, we also have I.D.s,
numerical I.D.s that are primary keys,
but our keys, like I said,
encode the full path to the element.
That includes the app I.D., the kind,
any child and parent relationships are in there,
finally down to the primary key.
And so, it's useful-- it sounds funny,
but it's useful to actually use kind names that are short
and to use key names that are short,
because it reduces the amount of index space
you have to write,
which is kind of a funny thing.
Another one that's kind of similar
is there's a lot of really interesting things
you can do around encoding.
So one of the ones I know-- I forget the name of it.
If someone remembers it, shout it out.
There's this type of encoding
where if you know something's going to be--
happen really frequently,
you should just--you should make its representation
be as short as possible.
There we go. Huffman encoding, yeah.
So if you can do that--
if you know that, you know,
like, you're gonna have a celebrity
and they're gonna be using your microblogging system
all the time,
or you have one node that's connected to everything,
you want its primary key to be very short.
Because its--its value is going to be
in a lot of different lists.
It's going to be in a lot of relationships.
It's going to have a lot of index rows.
And you want to minimize its size.
And so if you can use that kind of encoding technique
you can really minimize your storage space a lot.
Cool.
Over here.
man: Is there any way to do like queries
or full text search?
Slatkin: To do like queries or full text search.
So, yeah, so the best answer we have--
there--okay, three answers.
So the first answer is that we have
a "poor man's" full text search,
which is the Google App Engine ext search module
that Ryan Barrett wrote.
That just uses merge-join, like I've described here,
to do full text queries.
It doesn't have ranking and ordering,
and it has a bunch of other problems,
but it's really good for simple stuff.
And, like I said, it's good for a small result set.
So if you know that you're only gonna get
20...20 results, 100 results,
and then can sort them in memory, it's great.
In terms of full text search,
then you need something like a real full text search.
This is, you know... you need ranking
and you need all kinds of duplicate elimination
and all kinds of stuff.
This is something that we know our users want,
we're asked about all the time.
I don't have anything to announce
but we know you want it.
You know, "This is Google.
"They don't have full text search.
What's going on?" Yeah, we know.
[laughter] It's hard, actually.
So...
we're--we're--that's the best answer I can give you.
I think there's a number three, but I forgot what it was.
See if there was...
where did that--ah.
Is there anything else...
[indistinct chatter]
Oh, this one's putting it down.
"One of the reasons for you giving us list properties
"is to save storage space.
"Since Google is handling storage, why do we care?"
Billing is enabled.
You can store...yeah.
You can--you can store gigabytes of data.
It's not just one gig of free quota.
Quotas--you can get that quota really high now.
And it's pretty cheap.
So you can store a lot of data without even knowing it.
So...yeah, you know.
We try to instill the best practices
in our programming model, in our APIs.
But we can't always do that,
and so this is kind of a cautionary tale
to tell you to-- to think about storage
before you-- before you actually
rack up a big bill or something.
I think I'm out of questions, and...
so thanks a lot for coming.
Again, I really appreciate it. [applause]
Mobile Analytics