share
interactive transcript
request transcript/captions
live captions
download
|
MyPlaylist
JEAN PAJEREK: --slide.
SPEAKER 1: I'm seeing it. Yes. OK, we are now recording. Thanks.
JEAN PAJEREK: Thank you, Sharon. And I hope that my able colleagues in the reporting core group will watch the chat for me and alert me if there are questions there. Thanks. Good morning, everybody. This is an SQL case study. And as Sharon mentioned, I am building on things that I talked about at our last session. This is for SQL beginners. I'm not assuming too much knowledge, just the knowledge of the last presentation that I gave. But we're going to build on things that we learned in that session.
Our goal for the session, in our case study, is a query that lists the bacrode number, enumeration, and chronology of items associated with a specific holdings record. It might interest you to know that this is something that actually came up in my library when I was talking to my colleague Jackie. And we were thinking, well, hey, can we make a query like that? Because we need that. That's what got the ball rolling on this case study.
Questions to consider before you start writing a query. What are the data points that you need? Which table or tables contain those data points? And will looking at this information in the folio user interface give us any clues to help figure this out? This is a screenshot of my little test case from the Inventory app. It's a serial with 116 item records attached to it.
If you want to look this up in the Inventory app, the instance HRID is this, 368384. And it is a Dutch publication. Looking at this in Inventory, you can see this little number here in the holdings accordion. It says 116. And that means that there's 116 item records associated with this holdings right here.
AUDIENCE: Jean--
JEAN PAJEREK: Yes?
AUDIENCE: May I pronounce the title of that-- [SPEAKING DUTCH].
JEAN PAJEREK: Thank you. I'm glad I didn't try.
AUDIENCE: [INAUDIBLE]
AUDIENCE: That's fabulous.
JEAN PAJEREK: It's lovely, thank you very much.
AUDIENCE: You're welcome.
JEAN PAJEREK: What I did was, I expanded this accordion. This is the little arrow that expands the accordion in the Inventory app. I did that, and you can see a list of the item records associated with this holdings. It includes the barcode number, and the enumeration, and the chronology, as long as well as some other data points. Now, we need to think about-- that's in the Inventory app. If we are looking at this in the tables, in the LDP, where would that same information be residing? This screenshot here is a partial list of the tables in the LDP. I'm looking at them in DBeaver.
And looking at this, we just looked at our holdings, and it seems like maybe this inventory holdings table might have some of the information that we need. Let's take a closer look at that table and see if that is, in fact, the one we need. This screenshot is showing the properties that are associated with the inventory holdings table. And this is something that you can do in DBeaver. If we have time, I'll show you how to do that. It shows you the data points in the table.
And we see, by looking at this list of properties, that there's really only one that has anything to do with items, and it's this one that says number of items. And that's not really one of the data points we need right now. We might eventually, in some other query, want number of items, but not for this one. We now need to ask the question, what other table or tables might contain the data points we need? Returning to the LDP, we see that there is a table called Inventory Items.
Could this table be the one that we need? Let's open it up. An examination of the properties of the Inventory Items table shows that this is the table with most of the data points we want. That's because the data points we want reside within Item Records. Here we are. We're in the Inventory Items table. It has the barcode, it has the chronology, and has enumeration. And those are some of the data points that we're looking for our query.
Let's open the SQL Editor in DBeaver and write a very simple query to retrieve our three data points from the Inventory Items table and run it. This is a Select query. I'm asking it to select those three data points that we know are in the table-- barcode, chronology, enumeration-- from the table that's called Inventory Items. Super simple. Let's see what happens. Do your results look something like this? I hope. What's missing from this? We needed some other kind of information from our report. I mean, so far, so good, but we're not done yet.
AUDIENCE: Title.
JEAN PAJEREK: What was that, Carol?
AUDIENCE: Title.
JEAN PAJEREK: Title. Well, this report is not going to give us the title.
AUDIENCE: No, no, no.
JEAN PAJEREK: We call our goal a query that lists the bar code number, enumeration, chronology of items associated with a specific holdings record-- not just any old holdings record, but a specific one. The problem is, our results make no reference to any specific holdings record. How do we bring in that data point? Let's take another look at the properties associated with the Inventory Items table. This is a list of at least some of the properties on that table, the data points on that table.
One of them is Holdings Record ID. This has some possibilities. Let's bring that in to our Select query. I just added it here. Barcode, chronology, enumeration, holdings record ID from that same table, Inventory Items. What happens is, we get that long alphanumeric ID string here, the holdings record ID. This is the UUID the universal unique ID. It's not that more user-friendly one that we would rather look at. At, least I-- speaking for myself, I would rather look at the HRID, the human readable ID, than this universal unique identifier, because this is a really complex number.
Suppose we want that more accessible and human-friendly holdings HRID in our result. How do we get that? Let's re-examine the properties associated with the Inventory Items table. Could this data point, called HRID, be what we're looking for? Caution, caution, caution. We know that in FOLIO, there are instance HRIDs, there are holdings HRIDs, and there are item HRIDs.
We also know, from our last session, that within LDP tables, the names of IDs are context-dependent. This means that within the table we've been looking at, which is called Inventory Items, the HRID data element refers to the item HRID, not the holdings HRID. This isn't the HRID we need for our query, it's the holdings one. I mean, it's the item one. It's not the holdings one. Any questions before I go on? No? OK.
In which table are we most likely to find the holdings HRID? I'll give you a hint. It's a table we've already looked at. Revisiting the properties associated with the inventory holdings table, we see the data element called HRID. This is the HRID for the holdings record, because we're in the Holdings table. Are people getting what I'm saying about this? OK. Joanne, thank you for nodding. I can see your face. Thanks for the feedback.
As it turns out, the Inventory Holdings table has a role to play in our query after all. Here's a little pop quiz. We need to connect the Inventory Items table to the Inventory Holdings table. To do that, we need to find a data point that the two tables have in common. In the Inventory Holdings table, the point of connection we need is called ID, i.e., it is the holdings record ID. What is the same data point called in the Inventory Items table? You can mentally think of that answer.
In the Inventory Items table, the holdings record ID is helpfully called Holdings Record ID. Yay, something that makes sense. To connect the two tables, we need to add a Join statement to our query, connecting them on the common data point. Just in case you are tempted to run this query right now, please don't. It can take a little bit of time to run, especially if a lot of people are running it all at one time. I just want to show you the changes I've made here. We've added the Join statement to connect the two tables using the holdings record ID as the match point.
Something else to notice here. We've got barcode, inventory holdings HRID. This is disambiguating which HRID we want by prepending this data source name, which is the Inventory Holdings table, to the name of the data element, which is simply HRID. We still want our enumeration and our chronology. We want our data points to come from Inventory Items but also, now, the Inventory Holdings table gets joined in. And the place that we're joining it is on this ID field.
So the Holdings Record in the inventory items table is going to be the same as the ID, which is going to be the holdings ID, in the Inventory Holdings table. This is getting kind of complicated. Are there any questions about this joining process, where we have the thing we know is going to be the common data point, but they're called different things in these different tables.
We need to tell the LDP and DBeaver which ones we're talking about, so we have to be very specific. This idea is from the Inventory Holdings table. That's the name of the table. That's the name of the data point. And then the holdings record ID is a unique name. It's unique data point from the Inventory Items table. And when we say equal sign, we're saying they have to be the same number. They have to match. That's the match point.
AUDIENCE: I have a quick question. It's more like something I'd like us to do in the future, as I need more reminders about when to use left join, when to just use join, when to use right join. I don't think we need to go into that now, but maybe in the future we could talk about the different types of joins and when you want to do which ones.
AUDIENCE: I'll make a note of that. I'll make a note of that.
JEAN PAJEREK: That's a really good question. And this is something we've talked about in our core group, because the last session I did, I used what's called an inner join, and I had the Venn diagram of the circles meeting in the middle. It just so happens, in this specific case, the left join and the inner join give the same result. And I have said to Sharon, I would like to know the difference too. And how do I know when to use which one? So yeah, we are talking about having a session that talks about that. Thanks, Laura. Anything else before we go on? OK. Anything else?
AUDIENCE: I just-- yeah. Since you're talking about the way these are different, it's the same data point, but they have different names. The only way I kept that straight is that, if I see a table and if it has a data point that doesn't have a prefix-- if it just says ID or if it just says number, whatever it says, it means it always belongs to that base table. And if it's in the Items table but if it's a holding data point, it will always say holdings something.
JEAN PAJEREK: That's exactly right, Vandana, thanks. In looking at the tables, once you become accustomed to looking at these tables, you'll notice that the first column is almost always an ID of some sort. And it belongs to whatever that table is referring to. On the Items table, that first column that says ID, that's the item ID. In the Holdings table, that first column that says ID, that's the holdings ID.
The results of that query look like this. There's a couple of problems with this result. Notice that one of our columns is called simply HRID. We know there's three kinds of HRIDs. Which HRID is it referring to? In our query, we selected the HRID associated with the holdings record. Going back here, see? It's the holdings HRID right there. But in this results display, it's completely unclear what HRID you're talking about right here. This is not ideal at all. But we can correct it.
The way to do that is to disambiguate the column header by using an alias. The alias is right here. We're saying to the system, we have this thing in the Inventory Holdings table. It's called HRID. But in my results, I want to call it holdings HRID so I can tell what the heck HRID is in there. Run it again with that alias, and the column label is more intelligible. It now says holdings HRID, just like I told it to. But there's still a problem.
This search result is actually unmanageably large. If you were to run that query, you would see it's a gigantic, gigantic result. We want to limit our results, if you remember, to the item records associated with a single pollings record. To do this, we need to add a criterion to our query by adding a "where" statement. Here's the "where" statement, right on the bottom. By adding the holdings HRID as a criterion in the "where" statement, we're able to limit the results to just the items linked to this specific holdings.
There it is. I'm just saying, look for the inventory holdings HRID. And this is the literal value. That's the literal HRID for that holdings record. We execute that query. Let's check the accuracy of the result. Recall that our test title, which I dare not pronounce, has 116 item records attached to it. We can feel confident in our query results because it also has 116 Results lines. Even so, this looks really random, this display of results. It's, like, helter skelter, everywhere.
Wouldn't it be nice if we could put the results into some kind of order that made sense to a human being? Yes, it would be good. Considering this data that we're working with, the best way to order the query results would be by either the enumeration or the chronology data. This is serial issues. To do this, we use an "order by" statement. And I've just tacked that onto the end. It's important for people to know that the "order by" statement has to come after the "where" statement.
And if we were to reverse the order of these statements in the query, we'd get a syntax error. And DBeaver isn't always the most illuminating in telling you exactly what you did wrong. So just take my word for it that the "order by" has to be after the "where" statement. The search result is now nicely ordered by year in ascending order. The earliest dates are first, and then they increase as we go down the list.
We can reverse the sort order by adding "desc" for descending order to the end of the "order by" statement. The default-- if you're doing "order by," the default is understood as ascending order. But if you want it to be descending, with the most recent things on top, you can add this little qualifier here. And here's how that looks. The new items are right on the top. Any questions? OK. Wow, stunned silence. OK.
Things to keep in mind. Query writing is an iterative process of trial and error, testing, verifying accuracy, and refining parameters. Before starting to write a query, identify the data elements you want to include in the results. Once the data elements are identified, look for the tables where those data elements reside. Having a known test case like ours to work with is a good way to verify your results. Let me just change my share for a second. I wanted to show people how we go about looking at those properties in the tables. Here's our tables. The first one we looked at was the Inventory Holdings table.
AUDIENCE: Jean, we're not seeing your--
JEAN PAJEREK: Oh, sorry. I have to reshare. Sorry. Thank you for telling me. OK, share and-- where did it go? There. Are you seeing it now? This is DBeaver.
AUDIENCE: And Mira has a question. Her question is, how would we query if we have, for example, 20 HRIDs? OK, that's a-- you would have to do a different statement where it's in. And I can send you the statement and how you would put it in DBeaver. You would have a whole series. You could list all 20 of the HRIDs that you want. And it could be any one of them.
JEAN PAJEREK: Yeah, you can absolutely do that. But we're doing a thing for beginners right now, Mira, so we're trying to keep it simple. Thanks, Vandana. Here I am. I'm in the LDP for Cornell. Probably not visible to you is the teeny, teeny, tiny, tiny green checkmark which says, this is the database I am connected to currently. I've opened the little hierarchy of things here until I got to the public tables. And that's where I found the Inventory Holdings table that we looked at first. It's right here. I'm going to double-click on that and open it up.
Notice the tabs right here, properties, data, and ER diagram. It opened up, by default, in the Properties tab. This is where I could see the data points that belong in this table. And I think we found that there was, like, number of items or something. Yeah, there's the number of items. And that's the only thing that says anything about items in this table.
The properties tell you the data points that are in the table. And if you were to click on this tab up here, this is table data. Then it gives you the actual-- eventually, gives you the actual table with the data points filled in. I didn't know that was going to take so long, but there you have it. Wow. OK, there it is. Source was invalidated.
Yeah, the reason that took a while is because I logged into DBeaver for earlier this morning, and it was lying unused for a while. And so the data source was invalidated. But it comes back eventually if you run a query on it, as long as you've got that little green checkmark that says you're connected. OK, here's what we see. And something for people to know. I don't-- well, this is not really right. I don't know what this version one is. What does that mean? Does anyone know what that means? Let's see.
AUDIENCE: I only know what version means in the context of source records.
JEAN PAJEREK: Yes, source bibliographic-type records.
AUDIENCE: Yeah. So I don't know what version is elsewhere.
JEAN PAJEREK: Maybe holdings records also have versions. These are all version one so far. Apart from that, the version-- and this is the-- sorry, this is the ID number and the version. The rest of the fields are in kind of alphabetical order, pretty much. If you know the name of the data point you're looking for, you can generally scroll across and find it-- sorry, my sharing tools are blocking. OK, there we go. See, across the top, these are in alphabetical order, pretty much.
And then except at the end, where we have this fearsome thing called data-- it's somewhat scary. This one's not scary, though. This is something that we're going to have to talk about on a different occasion, because this is a more advanced topic. But anyway, the main thing is, I wanted people to be able to see how we see which data points can be found in which table. The inventory items one that we use, this is where I found that it has the bar code and the chronology in there. And the enumeration is in there. This is where it's easy to see which data points are in a given table.
I think that's about all I have, unless people have questions. I mean, I do have these queries up, and I can run them if you want to see them run live. Or if people are interested in me putting the SQL into the chat, I can do that. Does anyone care about that? I can do it if you want. Yes, no? It's going to be in the slide, too. But of course, it's easier to copy if it's in the chat.
AUDIENCE: How do you then save it if you find what you want and you think that you're going to use it again?
JEAN PAJEREK: Ah, OK. I'm positioning my cursor up here in the upper pane, where the query is. I'm going to do Control-S to save it. But, Carol, if I were just writing this and I just was typing along, and I ran it to see if it worked and I wanted to save it, if I then went to close it-- here's the little Close box. It would say, do you want to save it?
AUDIENCE: OK, great.
JEAN PAJEREK: It's not going to let you go away from there without saving it. The other thing is that it's going to save it saying something like "Script 17," something really, really meaningless like that. So here's what I do. I do Control-S. I just do Control-S. and then if I wanted to change the name of it, position my cursor in the tab right here, right-click, and there is an option here to rename the file. Or you can also do it with Control-F2. And that way, you can give it a meaningful name that you can get back to more easily.
AUDIENCE: Perfect.
JEAN PAJEREK: OK. Any other questions?
AUDIENCE: There's a question just about, when you're looking at DBeaver and you're looking at a table, the little numbers, 23G and-- that was answered in the chat. Thank you, Vandana. It's the size of the table--
JEAN PAJEREK: Oh, these numbers here. Yeah.
AUDIENCE: --not the number of records, but the size, so it's just showing you large and then--
JEAN PAJEREK: Yeah, this is a biggie. 23 gigabytes, that's a big table.
AUDIENCE: Yes. We have a lot of data at Cornell. We have a large library. And our data transfer overnight takes a long time. Basically, we're between 12 and 14 hours, depending on things. We have new technology coming that will be close to real-time data transfer, and so completely different approach. That will be much faster for loading. But yeah, we have quite a bit of data, and it's just going to get bigger.
JEAN PAJEREK: OK, that's what I have for today. And Sharon will put up the recording and the slides so people can refer back to it if they want to. Thanks, Sharon.
AUDIENCE: Thank you so much, Jean. That was very helpful.
This presentation is targeted toward those who are new to SQL (Structured Query Language). The instructor walks you through the process of finding the correct data elements and building a query that will be run against FOLIO data in the Cornell Library's FOLIO LDP (Library Data Platform) reporting database. If you missed the "My First SQL Query" presentation, we recommend that you review this recording ahead of time, as this second demo builds on the concepts presented in the first one. My First SQL Query https://vod.video.cornell.edu/media/My%20First%20SQL%20Query/1_sm0tsmqz